From: Karsten Wenzel
Subject: How to do fast matrix-multiplication?
Date: 
Message-ID: <dsa02u$qdm$1@localhost.localdomain>
Hi,

I need to operate on huge two-dimensional arrays.
For example, a matrix-multiplication.

For a first test, I tried to multiply two small
arrays 500x500 (dimension hard coded, no
error checking or whatever):

(defun mult-array (a b)
   (let ((c (make-array '(500 500))))
     (dotimes (i 500)
       (dotimes (j 500)
         (dotimes (k 500)
           (setf (aref c i j) (+ (aref c i j)
                 (* (aref a i k) (aref b k j)))))))
     c))

(setf a (make-array '(500 500)))
(setf b (make-array '(500 500)))
(fill-array a) ; Not shown here
(fill-array b)
(mult-array a b)

This was slow. The algorithm isn't very smart, but
that's not the point.

OK, these arrays should only contain integers.
So I tried to define the arrays this way:

(make-array '(500 500) :element-type 'fixnum)

But I got no speed-up. I checked LispWorks and OpenMCL
and both are slower this way.

How to do it really fast? What should I do better?
Are they faster ways to access elements other than
aref?

Thank you in advance,
Karsten

From: Förster vom Silberwald
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <1139314843.439901.146000@g14g2000cwa.googlegroups.com>
Karsten Wenzel wrote:

> This was slow. The algorithm isn't very smart, but
> that's not the point.
>
> OK, these arrays should only contain integers.
> So I tried to define the arrays this way:

I cannot comment on your code since my Lisp is zero. However, if you
are really after some fast matrix-multiplication algorithms you should
google around a bit. There exists some techniques for such kind of
things (loop unrolling, cache size awareness, etc..).

A good idea could be also to have a look at some specialized libraries
and write a binding to your Lisp implemenation.

Btw: profiling the code will alse lead to some better insight. Then you
can quickly decide where most of the time is spent off (is it floating
point performance, or memory).

Schneewittchen
From: Alexander Schmolck
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <yfshd7bfj7s.fsf@oc.ex.ac.uk>
Karsten Wenzel <·······@tfh-berlin.de> writes:

> Hi,
> 
> I need to operate on huge two-dimensional arrays.
> For example, a matrix-multiplication.

Assuming you want to do number crunching, for best performance you'll want to
wrap atlas. There's very little chance you'll beat atlas performance with
home-brewn matrix multiplication etc. routines in any language and if most of
your arrays are large the FFI overhead should be negligible. Have a look at
matlisp as a possibly not great starting point (doesn't work with most
implementations, though).

> For a first test, I tried to multiply two small
> arrays 500x500 (dimension hard coded, no
> error checking or whatever):
> 
> (defun mult-array (a b)
>    (let ((c (make-array '(500 500))))
>      (dotimes (i 500)
>        (dotimes (j 500)
>          (dotimes (k 500)
>            (setf (aref c i j) (+ (aref c i j)
>                  (* (aref a i k) (aref b k j)))))))
>      c))
> 
> (setf a (make-array '(500 500)))
> (setf b (make-array '(500 500)))
> (fill-array a) ; Not shown here
> (fill-array b)
> (mult-array a b)
> 
> This was slow. The algorithm isn't very smart, but
> that's not the point.
> 
> OK, these arrays should only contain integers.
> So I tried to define the arrays this way:
> 
> (make-array '(500 500) :element-type 'fixnum)

See Edi's reply, also once you (declare (optimize (speed 3))), cmucl and sbcl
will also give you hints where to put declarations.

But integer matrix multiplication used to be one of the benchmarks in doug
bagley's language shootout, so you should be able to find a lisp version that
does that reasonably well floating around on the web somewhere.

Are you sure you don't generally want floating point, though?
 
'as
From: Isaac Gouy
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <1139417840.682009.84710@g44g2000cwa.googlegroups.com>
Alexander Schmolck wrote:
> Karsten Wenzel <·······@tfh-berlin.de> writes:
>
> > Hi,
> >
> > I need to operate on huge two-dimensional arrays.
> > For example, a matrix-multiplication.
>
> Assuming you want to do number crunching, for best performance you'll want to
> wrap atlas. There's very little chance you'll beat atlas performance with
> home-brewn matrix multiplication etc. routines in any language and if most of
> your arrays are large the FFI overhead should be negligible. Have a look at
> matlisp as a possibly not great starting point (doesn't work with most
> implementations, though).
>
> > For a first test, I tried to multiply two small
> > arrays 500x500 (dimension hard coded, no
> > error checking or whatever):
> >
> > (defun mult-array (a b)
> >    (let ((c (make-array '(500 500))))
> >      (dotimes (i 500)
> >        (dotimes (j 500)
> >          (dotimes (k 500)
> >            (setf (aref c i j) (+ (aref c i j)
> >                  (* (aref a i k) (aref b k j)))))))
> >      c))
> >
> > (setf a (make-array '(500 500)))
> > (setf b (make-array '(500 500)))
> > (fill-array a) ; Not shown here
> > (fill-array b)
> > (mult-array a b)
> >
> > This was slow. The algorithm isn't very smart, but
> > that's not the point.
> >
> > OK, these arrays should only contain integers.
> > So I tried to define the arrays this way:
> >
> > (make-array '(500 500) :element-type 'fixnum)
>
> See Edi's reply, also once you (declare (optimize (speed 3))), cmucl and sbcl
> will also give you hints where to put declarations.
>
> But integer matrix multiplication used to be one of the benchmarks in doug
> bagley's language shootout, so you should be able to find a lisp version that
> does that reasonably well floating around on the web somewhere.


http://shootout.alioth.debian.org/old/benchmark.php?test=matrix&lang=all

>
> Are you sure you don't generally want floating point, though?
>  
> 'as
From: Edi Weitz
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <ur76fqt57.fsf@agharta.de>
On Tue, 07 Feb 2006 12:25:21 +0100, Karsten Wenzel <·······@tfh-berlin.de> wrote:

> I need to operate on huge two-dimensional arrays.  For example, a
> matrix-multiplication.
>
> For a first test, I tried to multiply two small arrays 500x500
> (dimension hard coded, no error checking or whatever):
>
> (defun mult-array (a b)
>    (let ((c (make-array '(500 500))))
>      (dotimes (i 500)
>        (dotimes (j 500)
>          (dotimes (k 500)
>            (setf (aref c i j) (+ (aref c i j)
>                  (* (aref a i k) (aref b k j)))))))
>      c))
>
> (setf a (make-array '(500 500)))
> (setf b (make-array '(500 500)))
> (fill-array a) ; Not shown here
> (fill-array b)
> (mult-array a b)
>
> This was slow. The algorithm isn't very smart, but that's not the
> point.
>
> OK, these arrays should only contain integers.  So I tried to define
> the arrays this way:
>
> (make-array '(500 500) :element-type 'fixnum)
>
> But I got no speed-up. I checked LispWorks and OpenMCL and both are
> slower this way.
>
> How to do it really fast? What should I do better?  Are they faster
> ways to access elements other than aref?

There are a couple of open questions here.  First, what does "This was
slow" really mean?  Do you have something to benchmark against, or
does it just /feel/ slow?

Second, as you haven't posted here before: Did you compile the code?
For implementations like LispWorks and OpenMCL that compile to native
code that'd make a big difference.

Apart from that, if this is really a bottleneck, you should add
declarations to your code, i.e. within MULT-ARRAY there should be a
DECLARE that declares the types of A, B, and C as accurately as
possible.  Declaring the element type in MAKE-ARRAY as you did above
doesn't make a difference.

You can also try various OPTIMIZE qualities, including
implementation-dependent ones - for LispWorks for example you should
check chapter nine of the LispWorks User Guide.

HTH,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")
From: Karsten Wenzel
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <dsc3bt$bgj$1@localhost.localdomain>
Edi Weitz wrote:
> There are a couple of open questions here.  First, what does "This was
> slow" really mean?  Do you have something to benchmark against, or
> does it just /feel/ slow?

Yes, I did the same thing in C on my Mac Mini, it took about
4 seconds, in Java 8 and with LispWorks 41 seconds
(compiled, not interpreted).

> Apart from that, if this is really a bottleneck, you should add
> declarations to your code, i.e. within MULT-ARRAY there should be a
> DECLARE that declares the types of A, B, and C as accurately as
> possible.  Declaring the element type in MAKE-ARRAY as you did above
> doesn't make a difference.
> 
> You can also try various OPTIMIZE qualities, including
> implementation-dependent ones - for LispWorks for example you should
> check chapter nine of the LispWorks User Guide.

Thanks a lot, I will try this.

Karsten
From: Edi Weitz
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <ufymuxn7v.fsf@agharta.de>
On Wed, 08 Feb 2006 07:33:33 +0100, Karsten Wenzel <·······@tfh-berlin.de> wrote:

> Edi Weitz wrote:
>
>> Apart from that, if this is really a bottleneck, you should add
>> declarations to your code, i.e. within MULT-ARRAY there should be a
>> DECLARE that declares the types of A, B, and C as accurately as
>> possible.  Declaring the element type in MAKE-ARRAY as you did
>> above doesn't make a difference.  You can also try various OPTIMIZE
>> qualities, including implementation-dependent ones - for LispWorks
>> for example you should check chapter nine of the LispWorks User
>> Guide.
>
> Thanks a lot, I will try this.

With the following (quite aggressive) declarations

  (defun mult-array (a b)
    (declare (optimize speed
                       (safety 0)
                       (space 0)
                       (debug 1)
                       (compilation-speed 0)))
    (let ((c (make-array '(500 500) :element-type 'fixnum)))
      (declare (type (simple-array fixnum *) a b))
      (dotimes (i 500)
        (dotimes (j 500)
          (dotimes (k 500)
            (setf (aref c i j) (+ (aref c i j)
                                  (the fixnum (* (aref a i k) (aref b k j))))))))
      c))

CMUCL (19c on Linux x86) is sped up by a factor of 9 to 10 on my
machine - approx. 27.5 seconds vs. 2.9 seconds.

Cheers,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")
From: Joe Marshall
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <1139419959.952443.99540@g43g2000cwa.googlegroups.com>
Edi Weitz wrote:
>
> With the following (quite aggressive) declarations
>
>   (defun mult-array (a b)
>     (declare (optimize speed
>                        (safety 0)
>                        (space 0)
>                        (debug 1)
>                        (compilation-speed 0)))
>     (let ((c (make-array '(500 500) :element-type 'fixnum)))
>       (declare (type (simple-array fixnum *) a b))
>       (dotimes (i 500)
>         (dotimes (j 500)
>           (dotimes (k 500)
>             (setf (aref c i j) (+ (aref c i j)
>                                   (the fixnum (* (aref a i k) (aref b k j))))))))
>       c))
>
> CMUCL (19c on Linux x86) is sped up by a factor of 9 to 10 on my
> machine - approx. 27.5 seconds vs. 2.9 seconds.

With some common lisp compilers I have gotten better performance by
declaring the
array to be of element-type T rather than fixnum.  It seems that some
compiler upgrade `fixnum' to `signed-byte 32'.  This is fine, except
that it now has to strip and replace the tag when storing and fetching
from the array (on a low-tag machine this involves shifting).  By
making the element-type be T, you avoid the tagging operations and
things run faster.  (You may have to play around with the (setf aref)
code, though.  Storing an object of type T generally involves mucking
with the write-barrier unless you can persuade the system that you are
not storing a pointer object.)
From: bradb
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <1139421022.766351.324950@z14g2000cwz.googlegroups.com>
> With some common lisp compilers I have gotten better performance by
> declaring the
> array to be of element-type T rather than fixnum.  It seems that some
> compiler upgrade `fixnum' to `signed-byte 32'.  This is fine, except
> that it now has to strip and replace the tag when storing and fetching

Does this mean that the "natural" integer representation (ie, the most
performant) for Lisp code should be fixnum?  Am I right in saying that
a fixnum should probably hold 29-30 bytes of data + tag bits?  I'm not
familiar with Lisp internals and, although it makes sense, I had
assumed that (unsigned-byte-32) would be the fastest int
representation.  Guess that's what I get coming from the C world.

Cheers
Brad
From: ···············@yahoo.com
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <1139425574.493958.324310@o13g2000cwo.googlegroups.com>
This is my impression, though I can't speak for all CL implementations.
 I remember the manuals for a Symbolics Lisp Machine ca. 1986 saying
this explicitly.  There was a section on functions with "32" in their
names.  At the top of the section it said, in effect, "Don't use these
because you think 32-bit integers will be faster.  They're slower."
From: Zach Beane
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <m3wtg57mo7.fsf@unnamed.xach.com>
"bradb" <··············@gmail.com> writes:

> Am I right in saying that a fixnum should probably hold 29-30 bytes
> of data + tag bits?

Assuming you mean "29-30 bits", the specification mandates only that
it is a supertype of (signed-byte 16). Some implementations have much
smaller than 30 bits available in their fixnums.

Zach
From: Thomas A. Russ
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <ymipslwf842.fsf@sevak.isi.edu>
Zach Beane <····@xach.com> writes:

> "bradb" <··············@gmail.com> writes:
> 
> > Am I right in saying that a fixnum should probably hold 29-30 bytes
> > of data + tag bits?
> 
> Assuming you mean "29-30 bits", the specification mandates only that
> it is a supertype of (signed-byte 16). Some implementations have much
> smaller than 30 bits available in their fixnums.

Quite so.  Actually 29 bits seems to be quite popular, but with some
lower values.

On Mac OS X:
  29 bits:  MCL, OpenMCL, SBCL, CMUCL
  24 bits:  CLISP

On Sparc Solaris (32bit)
  29 bits:  ACL

On Linux
  29 bits:  ACL, CMUCL

On Windows
  I seem to recall seeing one lisp implementation at only 16 bits,
  but I'm not sure exactly which one and I can't try it out right now.
  

-- 
Thomas A. Russ,  USC/Information Sciences Institute
From: Rob Warnock
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <0audnbPqPoKrsXHeRVn-jA@speakeasy.net>
Thomas A. Russ <···@sevak.isi.edu> wrote:
+---------------
| Zach Beane <····@xach.com> writes:
| > Assuming you mean "29-30 bits", the specification mandates only that
| > it is a supertype of (signed-byte 16). Some implementations have much
| > smaller than 30 bits available in their fixnums.
| 
| Quite so.  Actually 29 bits seems to be quite popular, but with some
| lower values.
| 
| On Mac OS X:
|   29 bits:  MCL, OpenMCL, SBCL, CMUCL
...
| On Linux
|   29 bits:  ACL, CMUCL
+---------------

Actually, the 32-bit versions of CMUCL (and I would suspect SBCL as well) 
have *30*-bit (signed) fixnums. Yes, the object representation does use
the low three bits as tags, but *two* of the eight encodings are assigned
to FIXNUM[1], so there's an effective 30-bit (signed) range:

    cmu> (log (- most-positive-fixnum most-negative-fixnum -1) 2)

    30.0f0
    cmu> 


-Rob

[1] Lowtag #b000 is EVEN-FIXNUM-TYPE and lowtag #b100 is ODD-FIXNUM-TYPE.
    Thus, in the absence of overflow, native 32-bit machine arithmetic
    can be used to add or subtract FIXNUMs.

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: bradb
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <1139553657.351258.77980@f14g2000cwb.googlegroups.com>
What is the advantage of an implementation using low bits as tags?  I
get the feeling that it ought to be less efficient than using the
higher bits.  But I'm also sure that smarter people than me have
thought about it harder than I have, and that low bits must be no less
efficient than high bits ... so what's the history?

Cheers
Brad
From: Rob Warnock
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <o-OdncbcttKk33HeRVn-rg@speakeasy.net>
bradb <··············@gmail.com> wrote:
+---------------
| What is the advantage of an implementation using low bits as tags?  I
| get the feeling that it ought to be less efficient than using the
| higher bits.  But I'm also sure that smarter people than me have
| thought about it harder than I have, and that low bits must be no less
| efficient than high bits ... so what's the history?
+---------------

The best overview of the tradeoffs that I know of is here:

    ftp://ftp.cs.indiana.edu/pub/scheme-repository/doc/pubs/typeinfo.ps.gz
    "Representing Type Information in Dynamically Typed Languages",
    by David Gudeman (Oct. 1993).

But the essence of the lowtag choice is this: Since most heap objects
are as big as a cons or larger, and since on 32-bit byte-addressed
machines a cons is typically 8 bytes, you can choose to force 8-byte
alignment of heap objects without very much wastage. Given *that*, a
3-bit lowtag system is almost "free": heap objects can be accessed
directly using the descriptor as a native machine pointer without
having to mask off the lowtag field, provided only that a small
(type-dependent) constant can be subtracted from the descriptor by the
hardware during the process of doing address arithmetic [which most
instruction sets can do]. E.g., in 32-bit CMUCL the lowtag for a cons
is 3, so [after type checks] in C the operation (CDR foo) is simply
"((struct cons *)((u32)(foo) - 3))->cdr".


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Brian Downing
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <28XGf.546160$084.521033@attbi_s22>
In article <·······················@f14g2000cwb.googlegroups.com>,
bradb <··············@gmail.com> wrote:
> What is the advantage of an implementation using low bits as tags?  I
> get the feeling that it ought to be less efficient than using the
> higher bits.  But I'm also sure that smarter people than me have
> thought about it harder than I have, and that low bits must be no less
> efficient than high bits ... so what's the history?

I'm not sure historically... but I can make a few guesses.

Tradeoffs for high bits:
* No manipulation required for add or multiply.
* Carry flag doesn't work - have to check high bits instead.
* Accessing the full address range of memory requires shifts, which
  requires more instructions and means that you'll have untagged values
  sitting around in registers, which can be bad for your GC.

Tradeoffs for low bits (if fixnum low bits are 0):
* Add is still just ADD A, B.
* To multiply you need to shift only one argument right.
* Carry flag works normally.
* Pointers to the whole address range can be accessed (without
  manipulating the pointer) with indexing.  For example, if the CONS tag
  is 3, getting at the car is [REG-3] and the cdr is [REG+1].

-bcd
-- 
*** Brian Downing <bdowning at lavos dot net> 
From: Duane Rettig
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <o0k6c21eiw.fsf@franz.com>
Brian Downing <·············@lavos.net> writes:

> In article <·······················@f14g2000cwb.googlegroups.com>,
> bradb <··············@gmail.com> wrote:
>> What is the advantage of an implementation using low bits as tags?  I
>> get the feeling that it ought to be less efficient than using the
>> higher bits.  But I'm also sure that smarter people than me have
>> thought about it harder than I have, and that low bits must be no less
>> efficient than high bits ... so what's the history?
>
> I'm not sure historically... but I can make a few guesses.
>
> Tradeoffs for high bits:
> * No manipulation required for add or multiply.
> * Carry flag doesn't work - have to check high bits instead.
> * Accessing the full address range of memory requires shifts, which
>   requires more instructions and means that you'll have untagged values
>   sitting around in registers, which can be bad for your GC.
>
> Tradeoffs for low bits (if fixnum low bits are 0):
> * Add is still just ADD A, B.
> * To multiply you need to shift only one argument right.
> * Carry flag works normally.
> * Pointers to the whole address range can be accessed (without
>   manipulating the pointer) with indexing.  For example, if the CONS tag
>   is 3, getting at the car is [REG-3] and the cdr is [REG+1].

Also, if there are just two low bits for the tag (e.g. tag values 0
and 4 are used for even and odd fixnums, respectively, then accesses
to arrays with natural-sized elements (i.e. 32-bits) need no shifting.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: Carl Shapiro
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <ouyoe1faa9w.fsf@panix3.panix.com>
···@sevak.isi.edu (Thomas A. Russ) writes:

> On Windows
>   I seem to recall seeing one lisp implementation at only 16 bits,
>   but I'm not sure exactly which one and I can't try it out right now.

Corman Common Lisp uses the same tag scheme as the 68K MCL.  Both of
those Lisps have 28 bit (29 bits, if you count the sign bit) fixnums.
From: Duane Rettig
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <o0oe1e1equ.fsf@franz.com>
···@sevak.isi.edu (Thomas A. Russ) writes:

> Zach Beane <····@xach.com> writes:
>
>> "bradb" <··············@gmail.com> writes:
>> 
>> > Am I right in saying that a fixnum should probably hold 29-30 bytes
>> > of data + tag bits?
>> 
>> Assuming you mean "29-30 bits", the specification mandates only that
>> it is a supertype of (signed-byte 16). Some implementations have much
>> smaller than 30 bits available in their fixnums.
>
> Quite so.  Actually 29 bits seems to be quite popular, but with some
> lower values.
>
> On Mac OS X:
>   29 bits:  MCL, OpenMCL, SBCL, CMUCL
>   24 bits:  CLISP
>
> On Sparc Solaris (32bit)
>   29 bits:  ACL
>
> On Linux
>   29 bits:  ACL, CMUCL
>
> On Windows
>   I seem to recall seeing one lisp implementation at only 16 bits,
>   but I'm not sure exactly which one and I can't try it out right now.

All 32-bit Allegro CL implementations have 30 (signed) bits for fixnums
(i.e 29-bits plus sign).  Our 64-bit implementations all have 61 bit
(signed) bits for fixnums (i.e., 60 bits plus sign).

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: Duane Rettig
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <o0d5hx3shk.fsf@franz.com>
"bradb" <··············@gmail.com> writes:

>> With some common lisp compilers I have gotten better performance by
>> declaring the
>> array to be of element-type T rather than fixnum.  It seems that some
>> compiler upgrade `fixnum' to `signed-byte 32'.  This is fine, except
>> that it now has to strip and replace the tag when storing and fetching
>
> Does this mean that the "natural" integer representation (ie, the most
> performant) for Lisp code should be fixnum?  Am I right in saying that
> a fixnum should probably hold 29-30 bytes of data + tag bits?  I'm not
> familiar with Lisp internals and, although it makes sense, I had
> assumed that (unsigned-byte-32) would be the fastest int
> representation.  Guess that's what I get coming from the C world.

Often this is the case, but 29-30 bits assumes a 32-bit lisp.  If
you move to a 64-bit lisp, the fixnum size can be much larger, say,
60 bits plus sign...

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: Carl Taylor
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <1UqGf.358622$qk4.107074@bgtnsc05-news.ops.worldnet.att.net>
"Joe Marshall" <··········@gmail.com> wrote in message 
····························@g43g2000cwa.googlegroups.com...

> With some common lisp compilers I have gotten better performance by
> declaring the
> array to be of element-type T rather than fixnum.  It seems that some
> compiler upgrade `fixnum' to `signed-byte 32'.  This is fine, except
> that it now has to strip and replace the tag when storing and fetching
> from the array (on a low-tag machine this involves shifting).  By
> making the element-type be T, you avoid the tagging operations and
> things run faster.

This is apparently the situation in LW. Type 'fixnum is routinely changed to 
'(signed-byte 32).

I wrote a program to calculate matrix determinants and discovered I could 
shave significant time off the processing for order 9, 10, and 11 matrices 
by always forcing the input matrix argument to have element type T.  I use 
<make-array> with the :displaced-to option in various places, and by doing 
this it's possible to eliminate the :element-type (array-element-type 
x-array) keyword argument on all these calls.

I had thought it was eliminating all the calls to <array-element-type> that 
helped with the speed, but maybe it's the tag stripping/replacing you 
mention.

Carl Taylor 
From: Duane Rettig
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <o08xsl3s0y.fsf@franz.com>
"Carl Taylor" <··········@att.net> writes:

> "Joe Marshall" <··········@gmail.com> wrote in message
> ····························@g43g2000cwa.googlegroups.com...
>
>> With some common lisp compilers I have gotten better performance by
>> declaring the
>> array to be of element-type T rather than fixnum.  It seems that some
>> compiler upgrade `fixnum' to `signed-byte 32'.  This is fine, except
>> that it now has to strip and replace the tag when storing and fetching
>> from the array (on a low-tag machine this involves shifting).  By
>> making the element-type be T, you avoid the tagging operations and
>> things run faster.
>
> This is apparently the situation in LW. Type 'fixnum is routinely
> changed to '(signed-byte 32).

Allegro CL's history on fixnum arrays is spotty.  We started out with
one originally, but some stupid developer (:-) decided to eliminate it,
because it was not much different than an array of type T, and it didn't
really follow the pattern of any of the other specialized arrays, which
were all geared toward matching foreign data sizes.  However, I've seen
the error in this, and since version 7.0 we've brought the fixnum array
back.  Why?  Because when you declare an array to be of type fixnum,
then the only access an aref can produce is a fixnum, and one needn't
say (the fixnum (aref a n)) - it can just be (aref a n) and the type
propagation should be able to infer that the value must be a fixnum.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: Edi Weitz
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <uslqt7njy.fsf@agharta.de>
On 8 Feb 2006 09:32:39 -0800, "Joe Marshall" <··········@gmail.com> wrote:

> Edi Weitz wrote:
>>
>> With the following (quite aggressive) declarations
>>
>>   (defun mult-array (a b)
>>     (declare (optimize speed
>>                        (safety 0)
>>                        (space 0)
>>                        (debug 1)
>>                        (compilation-speed 0)))
>>     (let ((c (make-array '(500 500) :element-type 'fixnum)))
>>       (declare (type (simple-array fixnum *) a b))
>>       (dotimes (i 500)
>>         (dotimes (j 500)
>>           (dotimes (k 500)
>>             (setf (aref c i j) (+ (aref c i j)
>>                                   (the fixnum (* (aref a i k) (aref b k j))))))))
>>       c))
>>
>> CMUCL (19c on Linux x86) is sped up by a factor of 9 to 10 on my
>> machine - approx. 27.5 seconds vs. 2.9 seconds.
>
> With some common lisp compilers I have gotten better performance by
> declaring the array to be of element-type T rather than fixnum.  It
> seems that some compiler upgrade `fixnum' to `signed-byte 32'.  This
> is fine, except that it now has to strip and replace the tag when
> storing and fetching from the array (on a low-tag machine this
> involves shifting).  By making the element-type be T, you avoid the
> tagging operations and things run faster.  (You may have to play
> around with the (setf aref) code, though.  Storing an object of type
> T generally involves mucking with the write-barrier unless you can
> persuade the system that you are not storing a pointer object.)

Interesting, thanks.

Which reminds me that I've seen a strange effect.  If I change the

       (declare (type (simple-array fixnum *) a b))

line above to

       (declare (type (simple-array fixnum *) a b c))

the code will run significantly (about 60%) slower.  Any Python
experts out there who know what's the reason for this?  Is this
related to the effect that Joe describes?

Thanks,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")
From: Christophe Rhodes
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <sqhd79ybkk.fsf@cam.ac.uk>
Edi Weitz <········@agharta.de> writes:

>>[Edi]
>>>   (defun mult-array (a b)
>>>     (declare (optimize speed
>>>                        (safety 0)
>>>                        (space 0)
>>>                        (debug 1)
>>>                        (compilation-speed 0)))
>>>     (let ((c (make-array '(500 500) :element-type 'fixnum)))
>>>       (declare (type (simple-array fixnum *) a b))
> Which reminds me that I've seen a strange effect.  If I change the
>
>        (declare (type (simple-array fixnum *) a b))
>
> line above to
>
>        (declare (type (simple-array fixnum *) a b c))
>
> the code will run significantly (about 60%) slower.  Any Python
> experts out there who know what's the reason for this?  Is this
> related to the effect that Joe describes?

Well, there's something odd about your code, to say the least.  As
you've written it, your type declaration is a free declaration for A
and B -- it does not refer to bindings.  The declaration for C, if
added, makes a bound declaration, but of course it should also be a
no-op declaration, because Python is quite capable of deriving the
type of C more precisely than your declaration.

What happens if you make the type declaration for A and B a bound
declaration (by putting it at the declarations for the DEFUN rather
than those for the LET)?  You may possibly want to see what happens if
you declare the array dimensions, too, at least to the point of
declaring that they are two-dimensional [ (* *) ] and maybe that A is
(500 *) and B is (* 500)?

(I think that, in CMUCL, Joe's point about write barriers does not
apply in this case; CMUCL has an unboxed specialization for FIXNUM
arrays.)

Christophe
From: Edi Weitz
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <uzml1sooi.fsf@agharta.de>
On Wed, 08 Feb 2006 18:01:15 +0000, Christophe Rhodes <·····@cam.ac.uk> wrote:

> Well, there's something odd about your code, to say the least.  As
> you've written it, your type declaration is a free declaration for A
> and B -- it does not refer to bindings.

Ah, I didn't know that.  I thought that such a declaration would apply
to the outermost corresponding binding.  Well, it's never too late to
learn something new... :)

Can you name chapter and verse in the CLHS where this is explained or
does one have to derive this somehow?

> The declaration for C, if added, makes a bound declaration, but of
> course it should also be a no-op declaration, because Python is
> quite capable of deriving the type of C more precisely than your
> declaration.

OK.

> What happens if you make the type declaration for A and B a bound
> declaration (by putting it at the declarations for the DEFUN rather
> than those for the LET)?

No noticeable difference.

> You may possibly want to see what happens if you declare the array
> dimensions, too, at least to the point of declaring that they are
> two-dimensional [ (* *) ] and maybe that A is (500 *) and B is (*
> 500)?

 (* *):                    no difference.

 A (500 *) and B (* 500):  down from ca. 2.95 seconds to 2.75 seconds

 both (500 500):           further down to ca. 2.50 seconds

Thanks,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")
From: Christophe Rhodes
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <sqslqthd8g.fsf@cam.ac.uk>
Edi Weitz <········@agharta.de> writes:

> On Wed, 08 Feb 2006 18:01:15 +0000, Christophe Rhodes <·····@cam.ac.uk> wrote:
>
>> Well, there's something odd about your code, to say the least.  As
>> you've written it, your type declaration is a free declaration for A
>> and B -- it does not refer to bindings.
>
> Ah, I didn't know that.  I thought that such a declaration would apply
> to the outermost corresponding binding.  Well, it's never too late to
> learn something new... :)
>
> Can you name chapter and verse in the CLHS where this is explained or
> does one have to derive this somehow?

Well, there's the glossary entries for "free declaration" and "bound
declaration".  The only declarations that have any defined semantic
effect in CLHS are special and notinline; both have examples of when a
free and bound declaration have different meanings.  For instance:

(defun foo1 (x)
  (declare (special x)) ; binds X with dynamic scope
  (bar)) ; if BAR references special X, will see FOO1's binding

(defun foo2 (x) ; binds X with lexical scope
  (locally 
    (declare (special x))
    (print x) ; refers to dynamic X
    (bar))) ; sees special X (not bound by FOO2)

(there's an even nastier example in the CLHS somewhere, but I can't
find it right now).

As for type declarations, I'm not sure of the effect in CMUCL-style
declarations-are-assertions mode of free declarations.  A bound
declaration inserts a type check (in safeish code, anyway); a free one
might just be believed implicitly.  Not sure.

Christophe
From: Juho Snellman
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <slrndukh88.6jq.jsnell@sbz-30.cs.Helsinki.FI>
Christophe Rhodes <·····@cam.ac.uk> wrote:
[ bound vs free declarations ]
> (there's an even nastier example in the CLHS somewhere, but I can't
> find it right now).

CLHS 3.3.4.1?

-- 
Juho Snellman
From: Christophe Rhodes
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <sqoe1hhci0.fsf@cam.ac.uk>
Juho Snellman <······@iki.fi> writes:

> Christophe Rhodes <·····@cam.ac.uk> wrote:
> [ bound vs free declarations ]
>> (there's an even nastier example in the CLHS somewhere, but I can't
>> find it right now).
>
> CLHS 3.3.4.1?

That's the one, yes.

Christophe
From: Duane Rettig
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <o0hd793slk.fsf@franz.com>
Christophe Rhodes <·····@cam.ac.uk> writes:

> Well, there's something odd about your code, to say the least.  As
> you've written it, your type declaration is a free declaration for A
> and B -- it does not refer to bindings. 

No, free declarations, when obeyed, affect references to the names within
the form thee declaration exists.   See the fourth paragraph of 3.3.4.
Edi's code is perfectly fine, because all of the references to a and b
are within the scope of the free declaration.  However, since no
compiler is required to obey any but a few declarations, it can also be
selective about them, as well.  So his code is fine as is, but may not
be as optimized in one implementation as in another.
 
Allegro CL used to ignore free declarations (except ones it was required
to obey, like special) until I added the Environment Access module.
It made it almost trivial to add (I say _almost_ because I had to get
rid of some of the special-casing for specials and other free-declarations
that had to be handled).  Now, the environments module just establishes
"contours" where the declarations are set up, and the older typing info
(if any) becomes undhadowed for the rest of the scope of the binding.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: Karsten Wenzel
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <dsellv$bj$1@localhost.localdomain>
Edi Weitz wrote:
> With the following (quite aggressive) declarations
> 
>   (defun mult-array (a b)
>     (declare (optimize speed
>                        (safety 0)
>                        (space 0)
>                        (debug 1)
>                        (compilation-speed 0)))
>     (let ((c (make-array '(500 500) :element-type 'fixnum)))
>       (declare (type (simple-array fixnum *) a b))
>       (dotimes (i 500)
>         (dotimes (j 500)
>           (dotimes (k 500)
>             (setf (aref c i j) (+ (aref c i j)
>                                   (the fixnum (* (aref a i k) (aref b k j))))))))
>       c))

I tried this on my Mac Mini and I got no speedup at
all for OpenMCL and LispWorks Personal, but its about
80-90% faster with CMUCL. Absolute: OpenMCL 77 seconds,
LW Personal 41 seconds and CMUCL 5 seconds. The
(unoptimized, without pointers) C program took
about 4 seconds.

Interesting. When I need something fast in future,
I will choose CMUCL ;-)

Karsten
From: Edi Weitz
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <u64nothxi.fsf@agharta.de>
On Thu, 09 Feb 2006 06:58:23 +0100, Karsten Wenzel <·······@tfh-berlin.de> wrote:

> I tried this on my Mac Mini and I got no speedup at all for OpenMCL
> and LispWorks Personal

Yep, low-level optimization is usually an implementation-dependent
task.

> but its about 80-90% faster with CMUCL. Absolute: OpenMCL 77
> seconds, LW Personal 41 seconds and CMUCL 5 seconds. The
> (unoptimized, without pointers) C program took about 4 seconds.
>
> Interesting. When I need something fast in future, I will choose
> CMUCL ;-)

Python (the compiler used by CMUCL, SBCL, and Scieneer) is usually
very good at producing fast code although I wouldn't infer from this
data point that it's always the winner.

If for some reason you want/have to use LispWorks and need fast
integer arithmetic you can check their "fast 32bit arithmetic:"

  <http://www.lispworks.com/documentation/lw445/LWUG/html/lwuser-93.htm>

Yes, there are certainly more elegant solutions - see here for
example:

  <http://www-jcsu.jesus.cam.ac.uk/~csr21/papers/modular/modular.pdf>

Cheers,
Edi.

-- 

European Common Lisp Meeting 2006: <http://weitz.de/eclm2006/>

Real email: (replace (subseq ·········@agharta.de" 5) "edi")
From: Thomas A. Russ
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <ymilkwkf7yn.fsf@sevak.isi.edu>
Edi Weitz <········@agharta.de> writes:

> On Thu, 09 Feb 2006 06:58:23 +0100, Karsten Wenzel <·······@tfh-berlin.de> wrote:
> 
> > Interesting. When I need something fast in future, I will choose
> > CMUCL ;-)
> 
> Python (the compiler used by CMUCL, SBCL, and Scieneer) is usually
> very good at producing fast code although I wouldn't infer from this
> data point that it's always the winner.

I think it's generally thought that the CMUCL et al. Python compiler is
the clear choice for fast numeric code.

If other performance items, like CLOS, for example, are more critical,
then there are likely to be other choices.  Different implementations
have placed different emphasis on where to try to optimize most heavily.

-- 
Thomas A. Russ,  USC/Information Sciences Institute
From: Förster vom Silberwald
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <1139483661.205516.130680@f14g2000cwb.googlegroups.com>
Karsten Wenzel wrote:

> > You can also try various OPTIMIZE qualities, including
> > implementation-dependent ones - for LispWorks for example you should
> > check chapter nine of the LispWorks User Guide.
>
> Thanks a lot, I will try this.

I am not going to drag you away from Common Lisp. However, the Bigloo
Scheme compiler (would) make(s) declaring types much more
straightforward and predictable.

Schneewittchen
From: Stephan Frank
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <dsb352$av9$1@ulysses.news.tiscali.de>
Karsten Wenzel wrote:
> Hi,
> 
> I need to operate on huge two-dimensional arrays.
> For example, a matrix-multiplication.
 >

Hallo Karsten,

apart from the other responses you already got, you might also want to 
have a look at Cyrus Harmon's clem-library [1]. I have not looked at it 
very deeply but according to the information given by its author it is 
designed to provide very high performance for matrix math making use of 
some macrology to make better use of the type information.

Regs,
Stephan

[1] http://www.cyrusharmon.org/cl/blog/display/37
From: ·············@specastro.com
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <1139421963.100485.235370@g43g2000cwa.googlegroups.com>
Karsten,

Fast linear algebra was recently discussed on gmane.lisp.lispworks:
http://thread.gmane.org/gmane.lisp.lispworks.general/5192, especially
my reply:
http://article.gmane.org/gmane.lisp.lispworks.general/5199

Even though it's a LispWorks group, it was about the libraries that are
currently available for doing linear algebra.

The standard papers you should read on this topic are:

"On using Common Lisp in scientific computing", by Nicolas Neuss, the
author of Femlisp,
http://www.iwr.uni-heidelberg.de/organization/sfb359/PP/Preprint2002-40.ps.gz

"Fast Floating-Point Processing in Common Lisp", by Richard Fateman, et
al, http://www.cs.berkeley.edu/~fateman/papers/lispfloat.ps

The bottom line answer is, there are libraries for doing this, you
don't have to code this up for yourself, especially if you need lots of
performance.

Since I need to do a lot of fast linear algebra work in my Common Lisp
simulation work, I've extensively researched this and have these
opinions, which, be mindful, are only my opinions.

For a pure Common Lisp solution (i.e., no dependence on foreign
libraries), your best bet is Femlisp at http://www.femlisp.org/.  Even
though being primarily a finite element method library, it does have a
pretty good basic linear algebra sub-library.  According to my
benchmarks, it's as fast as you'll be able to get with pure Common Lisp
code.  Which, by the way, is pretty fast.

For the absolute fastest linear algebra math you can get, you need to
use CMUCL or SBCL and use MatLisp at http://matlisp.sourceforge.net/.
It is a very extensive CLOS wrapping of the Fortran LAPACK libraries.
The docs say how to use the ATLAS version of the LAPACK libraries to
get even more speed.  In my benchmarks, I got a 5 fold speed increase
in solving a large system of linear equations from using ATLAS instead
of vanilla LAPACK.  I highly recommend MatLisp, as it is the library I
eventually chose and it has a large amount of the functionality of
LAPACK.  I am very happy with it.

Hope all of this helps.

Glenn
From: ·············@specastro.com
Subject: Re: How to do fast matrix-multiplication?
Date: 
Message-ID: <1139432992.686275.164660@z14g2000cwz.googlegroups.com>
Sorry about the multiple postings, Google had a system error when I
posted the reply, their system must have freaked out and added the
additional postings.