Optimizing general array functions?

From: Johann Hibschman
Subject: Optimizing general array functions?
Date: Wed, 01 Oct 1997 00:00:00 +0000
Message-ID: <60s76u$pr4@agate.berkeley.edu>

I've been trying to figure out the best way to write a function
which will take an arbitrary number of 1D arrays of double-float's
and sum them, like (a+ a1 a2 a3).

It's easy to write a function that does this, but I can't quite
figure out how to optimize it.  (I'm using CMUCL, btw.)

Does anyone have any suggestions?  I've tried implementing the
routine as follows:

;;; Test different methods of applying numerical operators to
;;; multiple vectors.

(declaim (optimize (speed 3) (safety 0)))

(defun a+ (&rest arrays)
  (let* ((n   (apply #'min (mapcar #'length arrays)))
	 (out (make-array n :element-type 'double-float)))
    (declare (fixnum n))
    (dotimes (i n)
      (declare (fixnum i))
      (setf (aref out i) (apply #'+ (mapcar #'(lambda (a) (aref a i))
					    arrays))))
    out))

(defun a+2 (&rest arrays)
  (let* ((n   (reduce #'min arrays :key #'length))
	 (out (make-array n :element-type 'double-float)))
    (declare (fixnum n))
    (dotimes (i n)
      (declare (fixnum i))
      (setf (aref out i)
	    (reduce #'+ arrays :key (lambda (a) (aref a i)))))
    out))

(defun a+3 (&rest arrays)
  (let* ((n   (apply #'min (mapcar #'length arrays)))
	 (out (make-array n :element-type 'double-float)))
    (declare (fixnum n))
    (apply #'map-into out #'+ arrays)))


(defun test (n-arrays n-elements n-times)
  (declare (fixnum n-arrays n-elements n-times))
  (let ((alist nil))
    (dotimes (i n-arrays)
      (declare (fixnum i))
      (push (make-array n-elements :element-type 'double-float
			:initial-element (* 1d0 i))
	    alist))
    (time
     (dotimes (i n-times)
       (declare (fixnum i))
       (apply #'a+ alist)))
    (time
     (dotimes (i n-times)
       (declare (fixnum i))
       (apply #'a+2 alist)))
    (time
     (dotimes (i n-times)
       (declare (fixnum i))
       (apply #'a+3 alist)))))


-- 
Johann A. Hibschman         | Grad student in Physics, working in Astronomy.
······@physics.berkeley.edu | Probing pulsar pair production processes.

Re: Optimizing general array functions? Bernhard Pfahringer
- Re: Optimizing general array functions? Johann Hibschman
  - Re: Optimizing general array functions? Bernhard Pfahringer
Re: Optimizing general array functions? Raymond Toy
Re: Optimizing general array functions? Kelly Murray

From: Bernhard Pfahringer
Subject: Re: Optimizing general array functions?
Date: Wed, 01 Oct 1997 00:00:00 +0000
Message-ID: <60tfjm$ia2$1@ftp.univie.ac.at>

In article <··········@agate.berkeley.edu>,
Johann Hibschman <······@physics12.Berkeley.EDU> wrote:
>I've been trying to figure out the best way to write a function
>which will take an arbitrary number of 1D arrays of double-float's
>and sum them, like (a+ a1 a2 a3).
>
>
>(declaim (optimize (speed 3) (safety 0)))
>
>(defun a+ (&rest arrays)
>  (let* ((n   (apply #'min (mapcar #'length arrays)))
>	 (out (make-array n :element-type 'double-float)))
>    (declare (fixnum n))
>    (dotimes (i n)
>      (declare (fixnum i))
>      (setf (aref out i) (apply #'+ (mapcar #'(lambda (a) (aref a i))
>					    arrays))))
>    out))
>

Hows about the following (plus more declarations):

(defun a+ (&rest arrays)
  (loop with n = (loop for a in arrays minimize (length a))
	with out = (make-array n :element-type 'double-float)
	for i below n do
	(setf (aref out i) (loop for a in arrays sum (aref a i)))
	finally
	(return out)))

But I guess that especially for larger and/or more vectors
the following should be more efficient due to locality of access
reasons:

(defun a+ (&rest arrays)
  (loop with n = (loop for a in arrays minimize (length a))
	with out = (make-array n
			       :element-type 'double-float
			       :initial-element 0.0d0)
	for a in arrays do
	(loop for i below n do
	      (incf (aref out i) (aref a i)))
	finally
	(return out)))

cheers, Bernhard
-- 
--------------------------------------------------------------------------
Bernhard Pfahringer
Austrian Research Institute for  http://www.ai.univie.ac.at/~bernhard/
Artificial Intelligence          ········@ai.univie.ac.at

From: Johann Hibschman
Subject: Re: Optimizing general array functions?
Date: Wed, 01 Oct 1997 00:00:00 +0000
Message-ID: <60uc9m$99@agate.berkeley.edu>

In article <············@ftp.univie.ac.at>,
Bernhard Pfahringer <········@korb.ai.univie.ac.at> wrote:

>But I guess that especially for larger and/or more vectors
>the following should be more efficient due to locality of access
>reasons:

I've been trying to understand these "locality of access issues," so
if you could elaborate a bit more, I'd appreciate it. 

I know it's better to access many data elements close to each other in
memory, due to caches, etc.  However, I see the C++ template numerics
people saying that their method is great because they implement (for
vectors a, b, c, d)

a = b + c + d;

as

for(i=0; i<N; i++)
  a[i] = b[i] + c[i] + d[i];

rather than as lots of pairwise interactions a = b + (c + d).

double temp[N];
for(i=0; i<N; i++) temp1[i] = c[i]+d[i];
for(i=0; i<N; i++) a[i] = b[i]+temp[i];

If the former is better, it's either because the cost of generating the
temporary, or because it's somehow better to only have to access each
element of 'a' once.

One 'easy method' in Lisp seems to be in between the two, in C terms:

for(i=0; i<N; i++) a[i] =  d[i];
for(i=0; i<N; i++) a[i] += c[i]
for(i=0; i<N; i++) a[i] += d[i]

which avoids the temporary, and only involves two arrays in each loop,
but which contains the looping overhead multiple times.

If temporary allocation is the problem, this is fine.  If accessing 'a'
or the looping overhead is the problem, then this is bad.  What
mental picture do I need to decide between the two?

>(defun a+ (&rest arrays)
>  (loop with n = (loop for a in arrays minimize (length a))
>	with out = (make-array n
>			       :element-type 'double-float
>			       :initial-element 0.0d0)
>	for a in arrays do
>	(loop for i below n do
>	      (incf (aref out i) (aref a i)))
>	finally
>	(return out)))

I haven't yet figured out the loop macro.  How would you insert the
appropriate declarations into this?

- Johann

-- 
Johann A. Hibschman         | Grad student in Physics, working in Astronomy.
······@physics.berkeley.edu | Probing pulsar pair production processes.

From: Bernhard Pfahringer
Subject: Re: Optimizing general array functions?
Date: Thu, 02 Oct 1997 00:00:00 +0000
Message-ID: <60vqtc$13fo$1@ftp.univie.ac.at>

In article <·········@agate.berkeley.edu>,
Johann Hibschman <······@physics12.Berkeley.EDU> wrote:
>In article <············@ftp.univie.ac.at>,
>
>I've been trying to understand these "locality of access issues," so
>if you could elaborate a bit more, I'd appreciate it. 
>
>I know it's better to access many data elements close to each other in
>memory, due to caches, etc.  However, I see the C++ template numerics
>people saying that their method is great because they implement (for
>vectors a, b, c, d)
>
>a = b + c + d;
>
>as
>
>for(i=0; i<N; i++)
>  a[i] = b[i] + c[i] + d[i];
>

For 3 vectors this is the better solution I guess, especially
as the temp-sum can be kept in a register. The problem with
this kind of solution starts when the number of vectors is
so large that accessing zzz[i] causes the prefetched range of
values b[i]-b[i+SOME-N] to drop out of the cache.

So for a small number of vectors the above solution rules,
whereas for a larger number the alternative is better.
The exact breakeven point depends on the specifics of your cache.

And if your vectors are large enough as to not fit into
main memory all at once (i.e. paging occurs), then
I guess the alternative is better, too. But you never know
without benchmarking.

>
>I haven't yet figured out the loop macro.  How would you insert the
>appropriate declarations into this?
>
Easy-peasy, I should have included some, but I was lazy,
the following compiles with no notes at all:

(defun a+ (&rest arrays)
  (loop with n = (loop for a of-type (simple-array double-float (*))
		       in arrays minimize (length a))
	with out = (make-array (the integer n)
			       :element-type 'double-float
			       :initial-element 0.0d0)
	for a of-type (simple-array double-float (*)) in arrays do
	(loop for i below n do
	      (incf (aref out i) (aref a i)))
	finally
	(return out)))

cheers, Bernhard

-- 
--------------------------------------------------------------------------
Bernhard Pfahringer
Austrian Research Institute for  http://www.ai.univie.ac.at/~bernhard/
Artificial Intelligence          ········@ai.univie.ac.at

From: Raymond Toy
Subject: Re: Optimizing general array functions?
Date: Wed, 01 Oct 1997 00:00:00 +0000
Message-ID: <4nyb4d8wvm.fsf@rtp.ericsson.se>

>>>>> "Johann" == Johann Hibschman <······@physics12.Berkeley.EDU> writes:

    Johann> I've been trying to figure out the best way to write a function
    Johann> which will take an arbitrary number of 1D arrays of double-float's
    Johann> and sum them, like (a+ a1 a2 a3).

    Johann> It's easy to write a function that does this, but I can't quite
    Johann> figure out how to optimize it.  (I'm using CMUCL, btw.)

    Johann> Does anyone have any suggestions?  I've tried implementing the
    Johann> routine as follows:

[code deleted]

How about this?  I get just one compiler note for make-array.  I
think this does what you want:

(defun a+ (&rest arrays)
  (declare (optimize (speed 3)))
  (let* ((n   (reduce #'min arrays :key #'length))
	 (out (make-array n :element-type 'double-float :initial-element 0d0)))
    (declare (fixnum n))
    (dolist (array arrays out)
      (declare (type (simple-array double-float (*)) array))
      (dotimes (k n)
	(incf (aref out k) (aref array k))))))

I assume each array is a simple-array of double-float's.

Ray

From: Kelly Murray
Subject: Re: Optimizing general array functions?
Date: Thu, 02 Oct 1997 00:00:00 +0000
Message-ID: <60vame$iaf$1@vapor.franz.com>

In article <··········@agate.berkeley.edu>, ······@physics12.Berkeley.EDU (Johann Hibschman) writes:
>> I've been trying to figure out the best way to write a function
>> which will take an arbitrary number of 1D arrays of double-float's
>> and sum them, like (a+ a1 a2 a3).
>> 
>> It's easy to write a function that does this, but I can't quite
>> figure out how to optimize it.  (I'm using CMUCL, btw.)
>> 
>> (declaim (optimize (speed 3) (safety 0)))
>> 
>> (defun a+ (&rest arrays)
>>   (let* ((n   (apply #'min (mapcar #'length arrays)))
>> 	 (out (make-array n :element-type 'double-float)))
>>     (declare (fixnum n))
>>     (dotimes (i n)
>>       (declare (fixnum i))
>>       (setf (aref out i) (apply #'+ (mapcar #'(lambda (a) (aref a i))
>> 					    arrays))))
>>     out))

The real key to speeding this stuff up is using declarations for
the arrays so the compiler can opencode the float operations
without consing up a bunch of temporary float objects.
This likely means you can't #'apply or #'function the operator,
but it must be called directly in the source on the declared arg types.

This leads to writing a macro instead of a function, that can expand
into the fast code.  It may or maynot crimp your style, since as a macro,
it can't be passed as a function, but you can define a "slow" version
as above to handle the general case. 
the "op" could be a closure, and not a simple arithmetic symbol,
but that would take a small change to the macro.
The macros handles all the variable args stuff at "compile" time.

Here is an implementation.  It is "hairy", but it will produce
highly optimal code (at least under ACL)

Hope this is helpful.

-Kelly Murray  ···@franz.com

(defmacro float-vector-operation (op &rest vectors)
  (let* ((vbinds (mapcar #'(lambda (v)`(,(gensym) ,v)) vectors))
         (vvars (mapcar #'car vbinds))
         )
  `(let* (
          ,@vbinds  ;; each vector arg is evaluated, and bound to a local
         ;; now find the minimum length
         (n (min ,@vvars))
         ;; make the result vector
 	 (out (make-array n :element-type 'double-float))
 	 )
     (declare (fixnum n) (type (simple-array double-float (*)) out)
              (type (simple-array double-float (*)) ,@vvars)
              (optimize (speed 3) (space 0) (safety 0)) ;; yee-ha!
     )
     (dotimes (i n)
       (declare (fixnum i))
       (setf (aref out i)
             (,op ,@(mapcar #'(lambda (v)`(aref ,v i)) vvars))))
     out))))

Here is an example of how it would be called:

 (let* ((v1+v2 (float-vector-operation + v1 v2))) 
    ...)

and what it expands into:
 
 (let* ((v1+v2 
         (let* ((#:gensym1 v1)
                (#:gensym2 v2)
                (n (min #:gensym1 #:gensym2))
                (out (make-array n :element-type 'double-float))
                )
           (declare (fixnum n) (type (simple-array double-float (*)) out)
                    (type (simple-array double-float (*)) #:gensym1 #:gensym2)
                    (optimize (speed 3) (space 0) (safety 0))
           (dotimes (i n)
             (declare (fixnum i))
             (setf (aref out i) (+ (aref #:gensym1 i) (aref #:gensym2 i))))
           out)))
  ..)