From: Mike G.
Subject: Improved MATLISP
Date: 
Message-ID: <8118f55c-aa5e-4c55-b1da-0032cd0df020@z17g2000hsg.googlegroups.com>
Hi all,

Some time ago, I started a thread here about parallel matrix
operations in Lisp. In that thread it was suggested that I try to
parallelize Matlisp. Being familiar w/ Matlisp, I responded that I
didn't think it would be possible to achieve any performance benefits
without rewriting the Fortran.

I was wrong.

The stock BLAS routine dgemm (which the Matlisp m* uses) doesn't
exploit the cache architecture of modern CPUs. Below I've included
some simple Lisp code which uses Matlisp to perform matrix
multiplication by splitting the matrices into row and column vectors
and forming the dot product. This method utilizes cache more
effectively because at any given time, a row vector can reside in
cache as it is distributed over the column vectors. For large matrices
that cannot both reside in cache at the same time, this is a win.

The mymult-1 function exhibits a 10-12% speed increase over the stock
Matlisp m* / dgemm code. I'm sure this could be further tuned.

The mymult-2 function exhibits a 98% speed increase over m* on my dual-
core by exploiting parallelism. The 10-12% gain from mymult-1 is lost
due to thread creation overhead, but we still get a big win by using
both cores. There is room for improvement here too: Some extra code to
restrain the thread creation and pack more FLOPs into single thread
should gain allow another 15-20% speedup.

If anyone out there is using Matlisp (especially a custom variety
linked against ATLAS or the AMD Core lib) I'd love to work together to
build a Matlisp 3.0. Comments and ideas are welcome.

(defun split-matrix-into-vectors (matrix dir)
  (let* ((rows (1- (number-of-rows matrix)))
	 (cols (1- (number-of-cols matrix))))
    (loop for i from 0 to (if (eq dir :rows)
			      rows
			      cols) collecting
	 (make-real-matrix
	  (loop for j from 0 to (if (eq dir :rows)
				    cols
				    rows) collecting
	       (if (eq dir :rows)
		   (mref matrix i j)
		   (mref matrix j i)))))))

(defun wait-for-threads-to-complete (threads)
  "Waits for all threads to complete before returning NIL"
  (do ((L threads))
      ((null L) nil)
    (if (sb-thread:thread-alive-p (car L))
	(sleep *THREAD-PAUSE-SYNC*)
	(setf L (cdr L)))))

(defun mymult-2 (matrix1 matrix2)
  (let* ((m1 (split-matrix-into-vectors matrix1 :rows))
	 (m2 (split-matrix-into-vectors matrix2 :cols))
	 (rows (number-of-rows matrix1))
	 (cols (number-of-cols matrix2))
	 (A (make-real-matrix rows cols))
	 (threads (make-list 0)))
    (loop for v1 in m1 for i from 0 to (1- rows) do
	 (nconc threads (list
			 (sb-thread:make-thread #'(lambda ()
						    (loop for v2 in m2 for j from 0 to (1- cols) do
							 (setf (matrix-ref A i j) (dot v1 v2)))))))
	 (sleep 0.00001))
    (wait-for-threads-to-complete threads)
    A))

(defun mymult-1 (matrix1 matrix2)
  (let* ((m1 (split-matrix-into-vectors matrix1 :rows))
	 (m2 (split-matrix-into-vectors matrix2 :cols))
	 (rows (number-of-rows matrix1))
	 (cols (number-of-cols matrix2))
	 (A (make-real-matrix rows cols))
	 (threads (make-list 0)))
    (loop for v1 in m1 for i from 0 to (1- rows) do
	 (loop for v2 in m2 for j from 0 to (1- cols) do
	      (setf (matrix-ref A i j) (dot v1 v2))))
    A))

-M

From: ··········@gmail.com
Subject: Re: Improved MATLISP
Date: 
Message-ID: <8aa63bdc-8395-4b57-a83f-45aa351b92e7@s8g2000prg.googlegroups.com>
On Mar 8, 5:12 pm, "Mike G." <···············@gmail.com> wrote:
> Hi all,
>
> Some time ago, I started a thread here about parallel matrix
> operations in Lisp. In that thread it was suggested that I try to
> parallelize Matlisp. Being familiar w/ Matlisp, I responded that I
> didn't think it would be possible to achieve any performance benefits
> without rewriting the Fortran.
>
> I was wrong.
>
> The stock BLAS routine dgemm (which the Matlisp m* uses) doesn't
> exploit the cache architecture of modern CPUs. Below I've included
> some simple Lisp code which uses Matlisp to perform matrix
> multiplication by splitting the matrices into row and column vectors
> and forming the dot product. This method utilizes cache more
> effectively because at any given time, a row vector can reside in
> cache as it is distributed over the column vectors. For large matrices
> that cannot both reside in cache at the same time, this is a win.
>
> The mymult-1 function exhibits a 10-12% speed increase over the stock
> Matlisp m* / dgemm code. I'm sure this could be further tuned.
>
> The mymult-2 function exhibits a 98% speed increase over m* on my dual-
> core by exploiting parallelism. The 10-12% gain from mymult-1 is lost
> due to thread creation overhead, but we still get a big win by using
> both cores. There is room for improvement here too: Some extra code to
> restrain the thread creation and pack more FLOPs into single thread
> should gain allow another 15-20% speedup.
>
> If anyone out there is using Matlisp (especially a custom variety
> linked against ATLAS or the AMD Core lib) I'd love to work together to
> build a Matlisp 3.0. Comments and ideas are welcome.
>
> (defun split-matrix-into-vectors (matrix dir)
>   (let* ((rows (1- (number-of-rows matrix)))
>          (cols (1- (number-of-cols matrix))))
>     (loop for i from 0 to (if (eq dir :rows)
>                               rows
>                               cols) collecting
>          (make-real-matrix
>           (loop for j from 0 to (if (eq dir :rows)
>                                     cols
>                                     rows) collecting
>                (if (eq dir :rows)
>                    (mref matrix i j)
>                    (mref matrix j i)))))))
>
> (defun wait-for-threads-to-complete (threads)
>   "Waits for all threads to complete before returning NIL"
>   (do ((L threads))
>       ((null L) nil)
>     (if (sb-thread:thread-alive-p (car L))
>         (sleep *THREAD-PAUSE-SYNC*)
>         (setf L (cdr L)))))
>
> (defun mymult-2 (matrix1 matrix2)
>   (let* ((m1 (split-matrix-into-vectors matrix1 :rows))
>          (m2 (split-matrix-into-vectors matrix2 :cols))
>          (rows (number-of-rows matrix1))
>          (cols (number-of-cols matrix2))
>          (A (make-real-matrix rows cols))
>          (threads (make-list 0)))
>     (loop for v1 in m1 for i from 0 to (1- rows) do
>          (nconc threads (list
>                          (sb-thread:make-thread #'(lambda ()
>                                                     (loop for v2 in m2 for j from 0 to (1- cols) do
>                                                          (setf (matrix-ref A i j) (dot v1 v2)))))))
>          (sleep 0.00001))
>     (wait-for-threads-to-complete threads)
>     A))
>
> (defun mymult-1 (matrix1 matrix2)
>   (let* ((m1 (split-matrix-into-vectors matrix1 :rows))
>          (m2 (split-matrix-into-vectors matrix2 :cols))
>          (rows (number-of-rows matrix1))
>          (cols (number-of-cols matrix2))
>          (A (make-real-matrix rows cols))
>          (threads (make-list 0)))
>     (loop for v1 in m1 for i from 0 to (1- rows) do
>          (loop for v2 in m2 for j from 0 to (1- cols) do
>               (setf (matrix-ref A i j) (dot v1 v2))))
>     A))
>
> -M

This is a great idea, but I believe that even better performance can
be achieved by using a good BLAS library which handles all these
details for you.  The default library that MATLISP links against is
fairly simplistic, and does not do anything cache sensitive, parallel,
etc.  ATLAS tends to be about 2-5 times faster, and is freely
available; however, I had some problems linking MATLISP against it ...
it sounds like you also haven't gotten it to work :)

There is a mailing list for people interested in mathematical stuff in
Lisp; I don't remember it off-hand, and it has been fairly quiet
lately, but that could help you find like-minded people.

Alex
From: Mike G.
Subject: Re: Improved MATLISP
Date: 
Message-ID: <53118cbc-dc41-4d3f-a86d-3fa241ab555c@e31g2000hse.googlegroups.com>
On Mar 8, 8:38 pm, ··········@gmail.com wrote:
> On Mar 8, 5:12 pm, "Mike G." <···············@gmail.com> wrote:
>
>
>
> > Hi all,
>
> > Some time ago, I started a thread here about parallel matrix
> > operations in Lisp. In that thread it was suggested that I try to
> > parallelize Matlisp. Being familiar w/ Matlisp, I responded that I
> > didn't think it would be possible to achieve any performance benefits
> > without rewriting the Fortran.
>
> > I was wrong.
>
> > The stock BLAS routine dgemm (which the Matlisp m* uses) doesn't
> > exploit the cache architecture of modern CPUs. Below I've included
> > some simple Lisp code which uses Matlisp to perform matrix
> > multiplication by splitting the matrices into row and column vectors
> > and forming the dot product. This method utilizes cache more
> > effectively because at any given time, a row vector can reside in
> > cache as it is distributed over the column vectors. For large matrices
> > that cannot both reside in cache at the same time, this is a win.
>
> > The mymult-1 function exhibits a 10-12% speed increase over the stock
> > Matlisp m* / dgemm code. I'm sure this could be further tuned.
>
> > The mymult-2 function exhibits a 98% speed increase over m* on my dual-
> > core by exploiting parallelism. The 10-12% gain from mymult-1 is lost
> > due to thread creation overhead, but we still get a big win by using
> > both cores. There is room for improvement here too: Some extra code to
> > restrain the thread creation and pack more FLOPs into single thread
> > should gain allow another 15-20% speedup.
>
> > If anyone out there is using Matlisp (especially a custom variety
> > linked against ATLAS or the AMD Core lib) I'd love to work together to
> > build a Matlisp 3.0. Comments and ideas are welcome.
>
> > (defun split-matrix-into-vectors (matrix dir)
> >   (let* ((rows (1- (number-of-rows matrix)))
> >          (cols (1- (number-of-cols matrix))))
> >     (loop for i from 0 to (if (eq dir :rows)
> >                               rows
> >                               cols) collecting
> >          (make-real-matrix
> >           (loop for j from 0 to (if (eq dir :rows)
> >                                     cols
> >                                     rows) collecting
> >                (if (eq dir :rows)
> >                    (mref matrix i j)
> >                    (mref matrix j i)))))))
>
> > (defun wait-for-threads-to-complete (threads)
> >   "Waits for all threads to complete before returning NIL"
> >   (do ((L threads))
> >       ((null L) nil)
> >     (if (sb-thread:thread-alive-p (car L))
> >         (sleep *THREAD-PAUSE-SYNC*)
> >         (setf L (cdr L)))))
>
> > (defun mymult-2 (matrix1 matrix2)
> >   (let* ((m1 (split-matrix-into-vectors matrix1 :rows))
> >          (m2 (split-matrix-into-vectors matrix2 :cols))
> >          (rows (number-of-rows matrix1))
> >          (cols (number-of-cols matrix2))
> >          (A (make-real-matrix rows cols))
> >          (threads (make-list 0)))
> >     (loop for v1 in m1 for i from 0 to (1- rows) do
> >          (nconc threads (list
> >                          (sb-thread:make-thread #'(lambda ()
> >                                                     (loop for v2 in m2 for j from 0 to (1- cols) do
> >                                                          (setf (matrix-ref A i j) (dot v1 v2)))))))
> >          (sleep 0.00001))
> >     (wait-for-threads-to-complete threads)
> >     A))
>
> > (defun mymult-1 (matrix1 matrix2)
> >   (let* ((m1 (split-matrix-into-vectors matrix1 :rows))
> >          (m2 (split-matrix-into-vectors matrix2 :cols))
> >          (rows (number-of-rows matrix1))
> >          (cols (number-of-cols matrix2))
> >          (A (make-real-matrix rows cols))
> >          (threads (make-list 0)))
> >     (loop for v1 in m1 for i from 0 to (1- rows) do
> >          (loop for v2 in m2 for j from 0 to (1- cols) do
> >               (setf (matrix-ref A i j) (dot v1 v2))))
> >     A))
>
> > -M
>
> This is a great idea, but I believe that even better performance can
> be achieved by using a good BLAS library which handles all these
> details for you.  The default library that MATLISP links against is
> fairly simplistic, and does not do anything cache sensitive, parallel,
> etc.  ATLAS tends to be about 2-5 times faster, and is freely
> available; however, I had some problems linking MATLISP against it ...
> it sounds like you also haven't gotten it to work :)
>
> There is a mailing list for people interested in mathematical stuff in
> Lisp; I don't remember it off-hand, and it has been fairly quiet
> lately, but that could help you find like-minded people.
>
> Alex

I indeed try to link against ATLAS, with no luck. But I don't think
ATLAS handles parallelization, just vectorization. I might be wrong.
I didn't try very hard, mind you - just tried the configure script
options, and when MATLISP didn't work right (missing alien functions)
I used the old Fortran BLAS.

I spent my evening evolving mymult-3, included below. Cuts run time of
an 800x800 square multiply from 11 seconds (using m*) to 3, by
utilizing cache better and keeping the threads contained.

(defvar *NUM-THREADS* 2)

(defun mymult-3 (matrix1 matrix2)
  (let* ((m1 (split-matrix-into-vectors matrix1 :rows))
	 (m2 (split-matrix-into-vectors matrix2 :cols))
	 (rows (number-of-rows matrix1))
	 (cols (number-of-cols matrix2))
	 (A (make-real-matrix rows cols))
	 (width (ceiling (/ rows *NUM-THREADS*)))
	 (starts (loop for i from 0 to (1- rows) by width collect i))
	 (ends (map 'list #'(lambda (x)
			       (let ((tmp (+ x (1- width))))
				 (if (> tmp rows)
				     rows
				     tmp)))
		     starts))
	 (m3 (map 'list #'(lambda (x y)
			    (subseq m1 x (1+ y)))
		  starts ends))
	 (threads (make-list 0)))
    (loop for v in m3 for k from 0 to rows by 2 do
      (nconc threads
	     (list (sb-thread:make-thread
		    #'(lambda ()
			(loop for v1 in v for i from k to (+ k 1 (length v1)) do
			     (loop for v2 in m2 for j from 0 to (1- cols) do
				  (setf (matrix-ref A i j) (dot v1 v2))))))))
	 (sleep 0.0000001))
    (wait-for-threads-to-complete threads)
    A))
From: Evan Monroig
Subject: Re: Improved MATLISP
Date: 
Message-ID: <87hcffjljf.fsf@obakechan.net>
"Mike G." <···············@gmail.com> writes:

> I indeed try to link against ATLAS, with no luck. But I don't think
> ATLAS handles parallelization, just vectorization. I might be wrong.
> I didn't try very hard, mind you - just tried the configure script
> options, and when MATLISP didn't work right (missing alien functions)
> I used the old Fortran BLAS.
>
> I spent my evening evolving mymult-3, included below. Cuts run time of
> an 800x800 square multiply from 11 seconds (using m*) to 3, by
> utilizing cache better and keeping the threads contained.

Hi,

I don't have too much time to spend to help you but I can run some tests
on my system with the functions that you provided

(in-package :matlisp)

(defvar *thread-pause-sync* 0.00001)

....[your functions]

(progn
 (sb-profile:profile m*
		     mymult-1
		     mymult-2
		     mymult-3)
 (sb-profile:reset)
 (dotimes (i 10)
   (let ((a (rand 800))
	 (b (rand 800)))
     (m* a b)
     (mymult-1 a b)
     (mymult-2 a b)
     (mymult-3 a b)))
 (sb-profile:report)
 (sb-profile:unprofile))


;;;; (progn  (sb-profile:profile m* 		     mymult-1 		     mymult ...
  seconds  |     consed    | calls |  sec/call  |  name  
-----------------------------------------------------------
    66.200 |   875,946,224 |    10 |   6.619999 | MYMULT-2
    52.405 |   872,307,704 |    10 |   5.240499 | MYMULT-1
    19.579 |   668,410,024 |    10 |   1.957899 | MYMULT-3
    14.157 |    51,200,080 |    10 |   1.415699 | M*
-----------------------------------------------------------
   152.341 | 2,467,864,032 |    40 |            | Total

estimated total profiling overhead: .000 seconds
overhead estimation parameters:
  3.2000003e-8s/call, 1.368e-6s total profiling, 6.08e-7s internal profiling



This is on SBCL 1.0.6 on Linux X86 with a P4 3.2GHz that has 2 cores and
a cache size of 512KB for each processor.  The lisp optimization
parameters but changing to (optimize (speed 3) (safety 0)) does not
change much the results.

I think that the matlisp that I have is linked to Atlas version of Blas
but I am not sure.

Hope this helps,

Evan
From: Waldek Hebisch
Subject: Re: Improved MATLISP
Date: 
Message-ID: <fr4rn3$6dk$1@z-news.pwr.wroc.pl>
Mike G. <···············@gmail.com> wrote:
> I indeed try to link against ATLAS, with no luck. But I don't think
> ATLAS handles parallelization, just vectorization. I might be wrong.
> I didn't try very hard, mind you - just tried the configure script
> options, and when MATLISP didn't work right (missing alien functions)
> I used the old Fortran BLAS.
> 
> I spent my evening evolving mymult-3, included below. Cuts run time of
> an 800x800 square multiply from 11 seconds (using m*) to 3, by
> utilizing cache better and keeping the threads contained.
> 

You should really link to ATLAS.  ATM I can not try Matlisp, but
I can give you timing form Octave: on 1.86 Ghz Intel Core 2 multiplying
1000x1000 real double precision square random matrices takes 0.36s.
On 1.8 Ghz Athlon 64 multipling the same matrices takes 0.64s. 
Naive multiplication for such matrices needs almost 2*10^9 operations
(almost 10^9 additions and 10^9 multiplications), theortical
maximum for Athlon 64 is 2 operations per clock, for Core 2 max
is 4 operations per clock.  ATLAS gets pretty close to theoretical
max: on Athlon 64 we get 1.736 operations per clock, on Core 2
2.986 operations per clock.  Such results are possible only
using highly tuned inner loops.

BTW: Core ATLAS has C interface, so in principle should be easier
to use than Fortran interface (which in ATLAS is emulated).

-- 
                              Waldek Hebisch
·······@math.uni.wroc.pl 
From: Mike G.
Subject: Re: Improved MATLISP
Date: 
Message-ID: <34d8ea88-93c4-49cc-98ad-e406d66ab796@k13g2000hse.googlegroups.com>
On Mar 10, 10:44 pm, Waldek Hebisch <·······@math.uni.wroc.pl> wrote:
> Mike G. <···············@gmail.com> wrote:
> You should really link to ATLAS.  ATM I can not try Matlisp, but
> I can give you timing form Octave: on 1.86 Ghz Intel Core 2 multiplying
> 1000x1000 real double precision square random matrices takes 0.36s.
> On 1.8 Ghz Athlon 64 multipling the same matrices takes 0.64s.
> Naive multiplication for such matrices needs almost 2*10^9 operations
> (almost 10^9 additions and 10^9 multiplications), theortical
> maximum for Athlon 64 is 2 operations per clock, for Core 2 max
> is 4 operations per clock.  ATLAS gets pretty close to theoretical
> max: on Athlon 64 we get 1.736 operations per clock, on Core 2
> 2.986 operations per clock.  Such results are possible only
> using highly tuned inner loops.

I have ATLAS working w/ Matlisp now. I also have a Matlisp linked w/
ACML. On my dual-core Athlon64 X2 the timings for ATLAS and ACML are
about the same. I believe ACML will win over ATLAS on Barcelona, but I
haven't checked yet because my barcelona rig is down for upgrades at
the moment.

-M
From: Nicolas Neuss
Subject: Re: Improved MATLISP
Date: 
Message-ID: <87myp5skqy.fsf@ma-patru.mathematik.uni-karlsruhe.de>
Waldek Hebisch <·······@math.uni.wroc.pl> writes:

> Mike G. <···············@gmail.com> wrote:
>> I indeed try to link against ATLAS, with no luck. But I don't think
>> ATLAS handles parallelization, just vectorization. I might be wrong.
>> I didn't try very hard, mind you - just tried the configure script
>> options, and when MATLISP didn't work right (missing alien functions)
>> I used the old Fortran BLAS.
>> 
>> I spent my evening evolving mymult-3, included below. Cuts run time of
>> an 800x800 square multiply from 11 seconds (using m*) to 3, by
>> utilizing cache better and keeping the threads contained.
>> 
>
> You should really link to ATLAS.  ATM I can not try Matlisp, but
> I can give you timing form Octave: on 1.86 Ghz Intel Core 2 multiplying
> 1000x1000 real double precision square random matrices takes 0.36s.
> On 1.8 Ghz Athlon 64 multipling the same matrices takes 0.64s. 
> Naive multiplication for such matrices needs almost 2*10^9 operations
> (almost 10^9 additions and 10^9 multiplications), theortical
> maximum for Athlon 64 is 2 operations per clock, for Core 2 max
> is 4 operations per clock.  ATLAS gets pretty close to theoretical
> max: on Athlon 64 we get 1.736 operations per clock, on Core 2
> 2.986 operations per clock.  Such results are possible only
> using highly tuned inner loops.
>
> BTW: Core ATLAS has C interface, so in principle should be easier
> to use than Fortran interface (which in ATLAS is emulated).
>
>                               Waldek Hebisch

I second the recommendation to link to ATLAS.  One additional point: the
main part of ATLAS performance stems from performing matrix-matrix
multiplications on subblocks of the matrices which fit into the respective
caches (splitting into rows and columns as the OP did is suboptimal).  This
is in principle possible in Lisp, too.  However, it is tricky to choose the
right block-sizes, and therefore it is much more reasonable to use external
libraries (with the additional benefit, that you can be reasonable sure
that there are compilers, sometimes provided by the chip vendor, which
optimize the inner loops for a given architecture).  Also one should note
that the same problem does not occur for matrix-multiplication only, but
also in matrix decompositions, etc.

Nicolas

PS: Concerning Strassen's multiplication which was mentioned in this
thread: as much as I know, it is difficult to optimize it in the same way
as ordinary matrix-multiplication, and it is also numerically not as stable
as the ordinary MM wrt the component-wise error (see Golub-van Loan,
2.4.10).
From: viper-2
Subject: Re: Improved MATLISP
Date: 
Message-ID: <1598dec7-0d6c-42a4-b972-1b5e045a6d8a@z17g2000hsg.googlegroups.com>
On Mar 8, 8:38 pm, ··········@gmail.com wrote:
> On Mar 8, 5:12 pm, "Mike G." <···············@gmail.com> wrote:
>
>
>
> > Hi all,
>
> > Some time ago, I started a thread here about parallel matrix
> > operations in Lisp. In that thread it was suggested that I try to
> > parallelize Matlisp. Being familiar w/ Matlisp, I responded that I
> > didn't think it would be possible to achieve any performance benefits
> > without rewriting the Fortran.

> There is a mailing list for people interested in mathematical stuff in
> Lisp; I don't remember it off-hand, and it has been fairly quiet
> lately, but that could help you find like-minded people.


I believe the mailing list to which you refer is "lisp-matrix-devel":

http://groups.google.com/group/lisp-matrix-devel

I tried to reactivate some discussion recently by posting one of my
solutions "SPARSE-MATRIX-TRANSPOSE" from Lisp 3rd edition (Horn and
Winston):

http://groups.google.com/group/lisp-matrix-devel/msg/0aa2558c261459c0

I didn't get the impression that this group is at all active. Mike, I
suspect you're more likely to recruit mathematicians with similar
interests right here at CLL.

agt
From: Mike G.
Subject: Re: Improved MATLISP
Date: 
Message-ID: <f754881b-4295-4ee5-b988-f2fe04d1d4bd@c33g2000hsd.googlegroups.com>
On Mar 8, 9:38 pm, ··········@gmail.com wrote:>
> This is a great idea, but I believe that even better performance can
> be achieved by using a good BLAS library which handles all these
> details for you.  The default library that MATLISP links against is
> fairly simplistic, and does not do anything cache sensitive, parallel,
> etc.  ATLAS tends to be about 2-5 times faster, and is freely
> available; however, I had some problems linking MATLISP against it ...
> it sounds like you also haven't gotten it to work :)

Hey, where did you get that "2-5 times faster" figure? Thats about
what I get with mymult-3 over m*.

Just curious.

FWIW, I looked into ATLAS a bit more - it indeed does offer
parallelized implementations of many of the core algorithms.

-M
From: vanekl
Subject: Re: Improved MATLISP
Date: 
Message-ID: <fqvh7g$9sb$1@aioe.org>
Mike G. wrote:
> ... 
 > Below I've included
> some simple Lisp code which uses Matlisp to perform matrix
> multiplication by splitting the matrices into row and column vectors
> and forming the dot product. This method utilizes cache more
> effectively because at any given time, a row vector can reside in
> cache as it is distributed over the column vectors. For large matrices
> that cannot both reside in cache at the same time, this is a win.
> 
> The mymult-1 function exhibits a 10-12% speed increase over the stock
> Matlisp m* / dgemm code. I'm sure this could be further tuned.
> 
> The mymult-2 function exhibits a 98% speed increase over m* on my dual-
> core by exploiting parallelism. The 10-12% gain from mymult-1 is lost
> due to thread creation overhead, but we still get a big win by using
> both cores. There is room for improvement here too: Some extra code to
> restrain the thread creation and pack more FLOPs into single thread
> should gain allow another 15-20% speedup.

Yeah, I think the rule of thumb is to restrict the number of threads
to the number of cores you have available, or else spend time doing
useless context switching.


> If anyone out there is using Matlisp (especially a custom variety
> linked against ATLAS or the AMD Core lib) I'd love to work together to
> build a Matlisp 3.0. Comments and ideas are welcome.
 > ...

If you haven't tried Strassen's algorithm, you should consider it.
It can be made to work pretty well with threads, too.
From: Mike G.
Subject: Re: Improved MATLISP
Date: 
Message-ID: <172250ac-1bf0-42b5-8975-1d3b0b61ece5@n58g2000hsf.googlegroups.com>
On Mar 8, 10:15 pm, vanekl <·····@acd.net> wrote:
> Mike G. wrote:
> > ...
>
>  > Below I've included
>
>
>
> > some simple Lisp code which uses Matlisp to perform matrix
> > multiplication by splitting the matrices into row and column vectors
> > and forming the dot product. This method utilizes cache more
> > effectively because at any given time, a row vector can reside in
> > cache as it is distributed over the column vectors. For large matrices
> > that cannot both reside in cache at the same time, this is a win.
>
> > The mymult-1 function exhibits a 10-12% speed increase over the stock
> > Matlisp m* / dgemm code. I'm sure this could be further tuned.
>
> > The mymult-2 function exhibits a 98% speed increase over m* on my dual-
> > core by exploiting parallelism. The 10-12% gain from mymult-1 is lost
> > due to thread creation overhead, but we still get a big win by using
> > both cores. There is room for improvement here too: Some extra code to
> > restrain the thread creation and pack more FLOPs into single thread
> > should gain allow another 15-20% speedup.
>
> Yeah, I think the rule of thumb is to restrict the number of threads
> to the number of cores you have available, or else spend time doing
> useless context switching.

A very loose rule of thumb, sure - I usually like to allow double the
threads for the number of cores I have. Some threads might complete
faster than others, and during that remaining time the freed core has
no work to do. A little bit of time context switching among a small
number of threads can help that condition.

>
> > If anyone out there is using Matlisp (especially a custom variety
> > linked against ATLAS or the AMD Core lib) I'd love to work together to
> > build a Matlisp 3.0. Comments and ideas are welcome.
>
>  > ...
>
> If you haven't tried Strassen's algorithm, you should consider it.
> It can be made to work pretty well with threads, too.

Yeah, this occurred to me but it isn't so clear to me how to take
advantage of the cache hierarchy as well as with the dot product of
vectors approach. I'm sure it could be done, though.
From: Juho Snellman
Subject: Re: Improved MATLISP
Date: 
Message-ID: <87zlt75ign.fsf@vasara.proghammer.com>
"Mike G." <···············@gmail.com> writes:
> (defun mymult-2 (matrix1 matrix2)
>   (let* ((m1 (split-matrix-into-vectors matrix1 :rows))
> 	 (m2 (split-matrix-into-vectors matrix2 :cols))
> 	 (rows (number-of-rows matrix1))
> 	 (cols (number-of-cols matrix2))
> 	 (A (make-real-matrix rows cols))
> 	 (threads (make-list 0)))
>     (loop for v1 in m1 for i from 0 to (1- rows) do
> 	 (nconc threads (list
> 			 (sb-thread:make-thread #'(lambda ()
> 						    (loop for v2 in m2 for j from 0 to (1- cols) do
> 							 (setf (matrix-ref A i j) (dot v1 v2)))))))
> 	 (sleep 0.00001))
>     (wait-for-threads-to-complete threads)
>     A))

This code is horribly buggy. All threads created share the same
bindings for V1 and I. Thus it's possible that a thread you created
during an iteration where V1 had a certain value, will suddenly start
using a different value for V1 at some point of its execution. Here's
a smaller example showing the bug, and the way to properly fix it by
adding a binding that will not be shared between threads:

  * (loop for v1 in '(1 2 3 4 5)  
          do (let ((v v1))
               (sb-thread:make-thread
                 (lambda ()
                   (print (list v v1))))))

  (1 1) 
  (2 3) 
  (3 4) 
  (4 5) 
  (5 5) 

It looks like you've maybe noticed the bug yourself, and then tried to
work around it by inserting a SLEEP after each thread creation. That's
a bad idea. Any time you find yourself inserting a SLEEP to work
around a bug, you should stop, and find the root cause of the problem.

-- 
Juho Snellman
From: Mike G.
Subject: Re: Improved MATLISP
Date: 
Message-ID: <1da8d005-a4a9-47b0-a6a1-6f4cb299dd68@e39g2000hsf.googlegroups.com>
On Mar 10, 12:08 am, Juho Snellman <······@iki.fi> wrote:

> This code is horribly buggy. All threads created share the same
> bindings for V1 and I. Thus it's possible that a thread you created
> during an iteration where V1 had a certain value, will suddenly start
> using a different value for V1 at some point of its execution. Here's
> a smaller example showing the bug, and the way to properly fix it by
> adding a binding that will not be shared between threads:

You're right. I did notice the bug, and that is what the SLEEP work-
around was for. At the time, it seemed that the kludge was necessary
for expedience (I was more interesting in trying to get a handle on my
target machine's sweet spots).

What do you think of something like this:

(defun mymult-2a (matrix1 matrix2)
  (let* ((m1 (split-matrix-into-vectors matrix1 :rows))
	 (m2 (split-matrix-into-vectors matrix2 :cols))
	 (rows (number-of-rows matrix1))
	 (cols (number-of-cols matrix2))
	 (A (make-real-matrix rows cols))
	 (threads (map 'list #'(lambda (v1 i)
				 (sb-thread:make-thread
				  #'(lambda ()
				      (loop for v2 in m2 for j from 0 to (1- cols) do
					   (setf (matrix-ref A i j) (dot v1 v2))))))
		       m1
		       (loop for i from 0 to (1- rows) collecting i))))
    (wait-for-threads-to-complete threads)
    A))

-M
From: Raymond Toy (RT/EUS)
Subject: Re: Improved MATLISP
Date: 
Message-ID: <sxdwsoaoemg.fsf@rtp.ericsson.se>
>>>>> "Mike" == Mike G <···············@gmail.com> writes:

    Mike> If anyone out there is using Matlisp (especially a custom variety
    Mike> linked against ATLAS or the AMD Core lib) I'd love to work together to
    Mike> build a Matlisp 3.0. Comments and ideas are welcome.

I have used ATLAS with matlisp.  It's supposed to work if you
configure matlisp to use atlas, but I haven't done that in quite a
while.  

But I would be happy to work with you to improve Matlisp.  The matlisp
mailing lists might be more appropriate for that, though.

Ray
From: viper-2
Subject: Re: Improved MATLISP
Date: 
Message-ID: <c01ef269-20f6-4925-a93b-a3f7e4a167ec@i29g2000prf.googlegroups.com>
On Mar 10, 9:08 am, ···········@ericsson.com (Raymond Toy (RT/EUS))
wrote:
> >>>>> "Mike" == Mike G <···············@gmail.com> writes:
>

> But I would be happy to work with you to improve Matlisp.  The matlisp
> mailing lists might be more appropriate for that, though.
>
> Ray

Where might I find a comparison of Maxima and Matlisp?

I am about to experiment with Maxima.

agt
From: Mike G.
Subject: Re: Improved MATLISP
Date: 
Message-ID: <4b84d4cd-61ee-4405-8b61-658f65a3ed5b@e60g2000hsh.googlegroups.com>
On Mar 11, 3:54 pm, viper-2 <········@mail.infochan.com> wrote:
> On Mar 10, 9:08 am, ···········@ericsson.com (Raymond Toy (RT/EUS))
> wrote:
>
> > >>>>> "Mike" == Mike G <···············@gmail.com> writes:
>
> > But I would be happy to work with you to improve Matlisp.  The matlisp
> > mailing lists might be more appropriate for that, though.
>
> > Ray
>
> Where might I find a comparison of Maxima and Matlisp?
>
> I am about to experiment with Maxima.
>
> agt

I doubt an explicit write-up like this exists, since it would be
comparing apples to oranges.

Matlisp is a set of bindings to LAPACK/BLAS. It is intended for
numerical computations.

Maxima is a full-on symbolic mathematics language. It has numerical
abilities too - but is much slower than Matlisp in this regard. But,
you can define matrices with variables as elements, for instance and
perform operations on them. You can't do that with Matlisp.

-M
From: viper-2
Subject: Re: Improved MATLISP
Date: 
Message-ID: <781c44f6-eca5-42d6-bab0-0a218be8de57@e25g2000prg.googlegroups.com>
On Mar 13, 8:01 am, "Mike G." <···············@gmail.com> wrote:
> On Mar 11, 3:54 pm, viper-2 <········@mail.infochan.com> wrote:
>

> > Where might I find a comparison of Maxima and Matlisp?
>
> > I am about to experiment with Maxima.
>
> > agt
>
> I doubt an explicit write-up like this exists, since it would be
> comparing apples to oranges.
>

No, I couldn't find a comparison anywhere on the web.

> Matlisp is a set of bindings to LAPACK/BLAS. It is intended for
> numerical computations.
>
> Maxima is a full-on symbolic mathematics language. It has numerical
> abilities too - but is much slower than Matlisp in this regard. But,
> you can define matrices with variables as elements, for instance and
> perform operations on them. You can't do that with Matlisp.


Of course, the only thing to do is to get my fingers wet with
both. :-) Thanks,

agt
From: Mike G.
Subject: Re: Improved MATLISP
Date: 
Message-ID: <b29db7f3-db15-42f2-8824-0ce0a6d60e30@n77g2000hse.googlegroups.com>
On Mar 13, 9:31 am, viper-2 <········@mail.infochan.com> wrote:
> On Mar 13, 8:01 am, "Mike G." <···············@gmail.com> wrote:
>
> > On Mar 11, 3:54 pm, viper-2 <········@mail.infochan.com> wrote:
>
> > > Where might I find a comparison of Maxima and Matlisp?
>
> > > I am about to experiment with Maxima.
>
> > > agt
>
> > I doubt an explicit write-up like this exists, since it would be
> > comparing apples to oranges.
>
> No, I couldn't find a comparison anywhere on the web.
>
> > Matlisp is a set of bindings to LAPACK/BLAS. It is intended for
> > numerical computations.
>
> > Maxima is a full-on symbolic mathematics language. It has numerical
> > abilities too - but is much slower than Matlisp in this regard. But,
> > you can define matrices with variables as elements, for instance and
> > perform operations on them. You can't do that with Matlisp.
>
> Of course, the only thing to do is to get my fingers wet with
> both. :-) Thanks,
>
> agt

Yeah. FWIW, I really like Maxima. It was largely Maxima that brought
me to Lisp, in fact. I had used Emacs Lisp here-and-there for years,
of course.. and I had read Eric Raymond's quote about the "profound
enlightenment experience" one gets from Lisp. I just woke up one day
and said to myself "I have to check this Lisp thing out.." and I never
really looked back. Of course, I have the advantage that I program
largely as a hobby and so the inavailability of certain libs is a
feature for me as it gives me opportunities for projects. Can't wait
to have the free time to pursue them.

-M
From: viper-2
Subject: Re: Improved MATLISP
Date: 
Message-ID: <05d93a9a-bd22-4f12-806b-f277e0c61ebe@s8g2000prg.googlegroups.com>
On Mar 13, 10:25 am, "Mike G." <···············@gmail.com> wrote:
> >
> Yeah. FWIW, I really like Maxima. It was largely Maxima that brought
> me to Lisp.

I had kind of decided to get started with Maxima first, it's just that
there seemed to be much interest in optimizing MatLisp.

Good luck with the projects. I have so much programming, math (etc.
etc. ...) to catch up on myself that I get tired just thinking about
it. :-)

agt
From: John Thingstad
Subject: Re: Improved MATLISP
Date: 
Message-ID: <op.t7y3jlk0ut4oq5@pandora.alfanett.no>
P� Thu, 13 Mar 2008 14:01:33 +0100, skrev Mike G.  
<···············@gmail.com>:

> On Mar 11, 3:54 pm, viper-2 <········@mail.infochan.com> wrote:
>> On Mar 10, 9:08 am, ···········@ericsson.com (Raymond Toy (RT/EUS))
>> wrote:
>>
>> > >>>>> "Mike" == Mike G <···············@gmail.com> writes:
>>
>> > But I would be happy to work with you to improve Matlisp.  The matlisp
>> > mailing lists might be more appropriate for that, though.
>>
>> > Ray
>>
>> Where might I find a comparison of Maxima and Matlisp?
>>
>> I am about to experiment with Maxima.
>>
>> agt
>
> I doubt an explicit write-up like this exists, since it would be
> comparing apples to oranges.
>
> Matlisp is a set of bindings to LAPACK/BLAS. It is intended for
> numerical computations.
>
> Maxima is a full-on symbolic mathematics language. It has numerical
> abilities too - but is much slower than Matlisp in this regard. But,
> you can define matrices with variables as elements, for instance and
> perform operations on them. You can't do that with Matlisp.
>
> -M

I was experimenting with a .NET interface to Mathematica. My original idea  
was to use the C interface to mathlink, but this proved exceedingly  
complicated. I might add that Mathematica also uses LAPACK/BLAS under the  
hood for numerical computations and GNU Multiprecition library as well as  
providing symbolic computation abilities. (As a curiosity you can find  
several references to Stephen Wolfram in the Maxima source code as he  
worked on this in the early 80's before writing Mathematica.)

--------------
John Thingstad
From: viper-2
Subject: Re: Improved MATLISP
Date: 
Message-ID: <8d451a4c-31cb-4054-b824-4e9e48b257e6@e10g2000prf.googlegroups.com>
On Mar 13, 4:05 pm, "John Thingstad" <·······@online.no> wrote:

> (As a curiosity you can find
> several references to Stephen Wolfram in the Maxima source code as he
> worked on this in the early 80's before writing Mathematica.)

I knew that Mathematica was influenced by the original Macsyma (from
which Maxima is derived), but I didn't realise that Wolfram was
involved with the development of  Macsyma or Maxima.

agt
From: Mike G.
Subject: Re: Improved MATLISP
Date: 
Message-ID: <166bb9e2-0bcb-4877-a7ad-731da48e75de@v3g2000hsc.googlegroups.com>
On Mar 13, 7:12 pm, viper-2 <········@mail.infochan.com> wrote:
> On Mar 13, 4:05 pm, "John Thingstad" <·······@online.no> wrote:
>
> > (As a curiosity you can find
> > several references to Stephen Wolfram in the Maxima source code as he
> > worked on this in the early 80's before writing Mathematica.)
>
> I knew that Mathematica was influenced by the original Macsyma (from
> which Maxima is derived), but I didn't realise that Wolfram was
> involved with the development of  Macsyma or Maxima.
>
> agt

I didn't realize this either. I do love trivia ;)

I really liked using Mathematica in university. I read Wolfram's book
- A New Kind of Science - I really wish I had a copy of Mathematica
when I read it, to play with some of the ideas and code more. I did re-
implement some of it in DrScheme, but mostly contented myself with
reading it during my train ride commutes.

-M
From: viper-2
Subject: Re: Improved MATLISP
Date: 
Message-ID: <e3b46b0d-7d9e-4fb3-bedc-d0860683e590@d4g2000prg.googlegroups.com>
On Mar 13, 11:06 pm, "Mike G." <···············@gmail.com> wrote:
> On Mar 13, 7:12 pm, viper-2

> I didn't realize this either. I do love trivia ;)
>
> I really liked using Mathematica in university. I read Wolfram's book
> - A New Kind of Science - I really wish I had a copy of Mathematica
> when I read it, to play with some of the ideas and code more. I did re-
> implement some of it in DrScheme, but mostly contented myself with
> reading it during my train ride commutes.
>
> -M

Yes, Scheme is also on my projects list.:-)

agt