Faster array access in loops

From: Shiv
Subject: Faster array access in loops
Date: Wed, 08 Dec 1999 00:00:00 +0000
Message-ID: <lcso1d9uq5.fsf@balrog.ece.ucsb.edu>

Recently I have been writing some number crunching code and have
observed (as is well-known) that the Lisp code runs a factor of 2 to 3
times slower than comparable Fortran/C code (at least on my Sparc20).
For example:

(defun inner-product (a b)
  (declare (type (simple-array single-float (*)) a b)
	   (optimize (speed 3) (safety 0) (space 0) (debug 0)))
  (let ((n (array-dimension a 0))
	(acc 0.0))
    (declare (fixnum n)
	     (single-float acc))
    (dotimes (i n acc)
       (declare (fixnum i))
       (incf acc (* (aref a i) (aref b i))))))

will be within a factor of two of equivalent Fortran/C code.
Disassembling the compiled code seems to indicate that the slow-down
is mainly due to index-arithmetic for the array accesses.

It seems to me that this situation can be improved if the programmer
is given access to pointers that point to a specific array element.
(I followed the recent thread on locatives and this did not seem to be
mentioned.)  For example if I could write the above function as:

(defun inner-product (a b)
  (declare (type (simple-array single-float (*)) a b)
	   (optimize (speed 3) (safety 0) (space 0) (debug 0)))
  (let ((n (array-dimension a 0))
	(i 0)
	(acc 0.0)
	(a-locative (array-locative a 0))
	(b-locative (array-locative b 0)))
    (declare (fixnum n i)
	     (single-float acc)
	     (type array-locative a-locative b-locative))
    (loop
	(incf acc (* (aref-using-locative a a-locative)
		     (aref-using-locative b b-locative)))
	(incf i 1)
	(when (>= i n) (return acc))
	(incf-array-locative a-locative 1 a)
	(incf-array-locative b-locative 1 b))))

where

(array-locative array index) returns a pointer to (aref array index),
(aref-using-locative array array-locative) returns the element of the
array that the array-locative is pointing to,
(incf-array-locative array-locative increment array) destructively
adds a suitable number to array-locative,

then it seems to me that for the right optimization settings the
compiler could replace the (incf-array-locative ...) with an
immediate-add instruction, and the (aref-using-locative ...) form with
a simple load instruction.  For more conservative optimization
settings the calls can be made safe since the associated array is
always passed in.

My question is: Are such operations easy (for compiler writers) to
provide (if at all)?  By 'easy', I mean if existing compilers can be
easily extended to accommodate this.

--shiv--

PS: If GC-issues are not a problem, it seems the difficulty-level
should be on par with treating floats: that is, boxing and UN-boxing
the array-locatives.

Re: Faster array access in loops ········@cc.hut.fi
Re: Faster array access in loops Tim Bradshaw
- Re: Faster array access in loops Raymond Toy
- Re: Faster array access in loops Duane Rettig
Re: Faster array access in loops Jeff Dalton
Re: Faster array access in loops Duane Rettig
- Re: Faster array access in loops Bernhard Pfahringer
  - Re: Faster array access in loops Duane Rettig
- Re: Faster array access in loops Duane Rettig

From: ········@cc.hut.fi
Subject: Re: Faster array access in loops
Date: Wed, 08 Dec 1999 00:00:00 +0000
Message-ID: <m3wvqphxei.fsf@mu.tky.hut.fi>

Shiv <····@balrog.ece.ucsb.edu> writes:

> (defun inner-product (a b)

This is besides the point, but for things like the inner product it may
be a worthwhile to use vendor math libraries which should be in
hand-tuned assembly code.  For SPARC, Sun's Workshop includes the Sun
Performance Library which contains optimized versions of the BLAS and
LINPACK functions among others, and they work fine with CMUCL.
Of course, if your vectors are short, doing a foreing function call may
be too much overhead.

Hannu Rummukainen

From: Tim Bradshaw
Subject: Re: Faster array access in loops
Date: Thu, 09 Dec 1999 00:00:00 +0000
Message-ID: <ey3puwgdsed.fsf@cley.com>

[I sent this in private mail, slightly by mistake, as it would be
interesting if someone who understands sparc better than me (Duane?)
could comment!]

* Shiv  wrote:
> Recently I have been writing some number crunching code and have
> observed (as is well-known) that the Lisp code runs a factor of 2 to 3
> times slower than comparable Fortran/C code (at least on my Sparc20).
> For example:

> (defun inner-product (a b)
>   (declare (type (simple-array single-float (*)) a b)
> 	   (optimize (speed 3) (safety 0) (space 0) (debug 0)))
>   (let ((n (array-dimension a 0))
> 	(acc 0.0))
>     (declare (fixnum n)
> 	     (single-float acc))
>     (dotimes (i n acc)
>        (declare (fixnum i))
>        (incf acc (* (aref a i) (aref b i))))))

> will be within a factor of two of equivalent Fortran/C code.
> Disassembling the compiled code seems to indicate that the slow-down
> is mainly due to index-arithmetic for the array accesses.

What Lisp and/or C compiler are you using?  For my tests -- in CMUCL
and Allegro on a SPARC, vs gcc, I get a good deal better than a factor
of 2 for code like this, more like 1.2 or something. I'm not fluent
enough in sparc assembler to understand it in detail, but the lisp & C
outputs I get are:

gcc (2.95) -O2, inner product bit:

    .LL11:
	    ld	[%g3+%o0], %f3
	    addcc	%g2, -1, %g2
	    ld	[%g3+%o1], %f2
	    fmuls	%f3, %f2, %f3
	    add	%g3, 4, %g3
	    bne	.LL11
	    fadds	%f0, %f3, %f0


Allegro:

      76: f9022002     ld	[%o0 + 2], %f28
      80: 9006000a     add	%i0, %o2, %o0
      84: f5022002     ld	[%o0 + 2], %f26
      88: b9a7093a     fmuls	%f28, %f26, %f28
      92: bda7883c     fadds	%f30, %f28, %f30
      96: 9402a004     add	%o2, #x4, %o2
    lb4:
     100: 80a28009     cmp	%o2, %o1
     104: 26bffff9     bl,a	76

CMUCL:

    5B4: L1:   add   %nl1, 1, %cfunc
    5B8:       ldf   [%a0+%cfunc], %f1
    5BC:       add   %nl1, 1, %cfunc
    5C0:       ldf   [%a1+%cfunc], %f2
    5C4:       fmuls %f1, %f1, %f2
    5C8:       fadds %f0, %f0, %f1
    5CC:       add   4, %nl1
    5D0: L2:   cmp   %nl1, %a3
    5D4:       blt   L1

CMUCL seems to be significantly worse than the other two -- that
second add at the start can't be needed can it? gcc seems to be doing
slightly better loop stuff -- I *presume* the addcc is what the bne is
checking.  gcc is also getting the instructions in non-obvious orders,
which might be better, though I kind of assumed that modern CPUs do
this kind of instruction reordering themselves?

--tim

From: Raymond Toy
Subject: Re: Faster array access in loops
Date: Thu, 09 Dec 1999 00:00:00 +0000
Message-ID: <4nk8mnvq5e.fsf@rtp.ericsson.se>

>>>>> "Tim" == Tim Bradshaw <···@cley.com> writes:

    Tim> gcc (2.95) -O2, inner product bit:

    Tim>     .LL11:
    Tim> 	    ld	[%g3+%o0], %f3
    Tim> 	    addcc	%g2, -1, %g2
    Tim> 	    ld	[%g3+%o1], %f2
    Tim> 	    fmuls	%f3, %f2, %f3
    Tim> 	    add	%g3, 4, %g3
    Tim> 	    bne	.LL11
    Tim> 	    fadds	%f0, %f3, %f0

[allegro snipped]

    Tim> CMUCL:

    Tim>     5B4: L1:   add   %nl1, 1, %cfunc
    Tim>     5B8:       ldf   [%a0+%cfunc], %f1
    Tim>     5BC:       add   %nl1, 1, %cfunc
    Tim>     5C0:       ldf   [%a1+%cfunc], %f2
    Tim>     5C4:       fmuls %f1, %f1, %f2
    Tim>     5C8:       fadds %f0, %f0, %f1
    Tim>     5CC:       add   4, %nl1
    Tim>     5D0: L2:   cmp   %nl1, %a3
    Tim>     5D4:       blt   L1

    Tim> CMUCL seems to be significantly worse than the other two --
    Tim> that second add at the start can't be needed can it? gcc

The adds at 5B4 and 5BC are due to the way CMUCL accesses arrays.  The
compiler uses VOP's (virtual ops) do to things, and the add and ldf at
5B4 and 5BC are the VOP used to access an element of a single-float
array.

I'm not aware of anyway to get CMUCL to produce better code than this,
short of hacking on the compiler and/or VOPs.  While the
assembler/compiler does try to schedule instructions by rearranging
them if possible, I don't think CMUCL even does any other sort of
peephole optimization; I think I've seen unconditional jumps to jumps.

    Tim> seems to be doing slightly better loop stuff -- I *presume*
    Tim> the addcc is what the bne is checking.  gcc is also getting

Yes.  None of the other instructions affect the (integer) condition
codes, so bne uses the one that was set by addcc.

Also note that gcc made good use of the delay slot in the branch.
CMUCL can do this sometimes, but not always.  gcc also changed the
loop counter to count down instead of up.  This helps by removing the
cmp instruction.

Ray

From: Duane Rettig
Subject: Re: Faster array access in loops
Date: Fri, 10 Dec 1999 00:00:00 +0000
Message-ID: <43dtbunlc.fsf@beta.franz.com>

Tim Bradshaw <···@cley.com> writes:

> [I sent this in private mail, slightly by mistake, as it would be
> interesting if someone who understands sparc better than me (Duane?)
> could comment!]

Well, on both the Allegro and the CMUCL disassembler outputs
you left out the last instruction in the loop, the one in the
delay slot after the branch:

> Allegro:
> 
>       76: f9022002     ld	[%o0 + 2], %f28
>       80: 9006000a     add	%i0, %o2, %o0
>       84: f5022002     ld	[%o0 + 2], %f26
>       88: b9a7093a     fmuls	%f28, %f26, %f28
>       92: bda7883c     fadds	%f30, %f28, %f30
>       96: 9402a004     add	%o2, #x4, %o2
>     lb4:
>      100: 80a28009     cmp	%o2, %o1
>      104: 26bffff9     bl,a	76
=======108:  Missing instruction here.

When I compiled it, I got the following: the setup for the
loop is this:

  24: 10800008     ba	56
  28: 96100000     mov	%g0, %o3

followed by the loop itself:

  32: f902a002     ld	[%o2 + 2], %f28
  36: 9406400b     add	%i1, %o3, %o2
  40: f502a002     ld	[%o2 + 2], %f26
  44: b9a7093a     fmuls	%f28, %f26, %f28
  48: bda7883c     fadds	%f30, %f28, %f30
  52: 9602e004     add	%o3, #x4, %o3

  56: 80a2c00c     cmp	%o3, %o4
  60: 26bffff9     bl,a	32
  64: 9406000b     add	%i0, %o3, %o2

Note that there are extra adds in both arrays because the sparc does
not provide both indexed and displaced loads and stores (actually, few
RISC machines do; the HP does, but the allowable displacement is so
small that an extra add is almost always done anyway).  You can load
from memory using a register and a displacement or a register and a
register, but not all three.

-- 
Duane Rettig          Franz Inc.            http://www.franz.com/ (www)
1995 University Ave Suite 275  Berkeley, CA 94704
Phone: (510) 548-3600; FAX: (510) 548-8253   ·····@Franz.COM (internet)

From: Jeff Dalton
Subject: Re: Faster array access in loops
Date: Thu, 09 Dec 1999 00:00:00 +0000
Message-ID: <x24sds2gzy.fsf@todday.aiai.ed.ac.uk>

Shiv <····@balrog.ece.ucsb.edu> writes:

> Recently I have been writing some number crunching code and have
> observed (as is well-known) that the Lisp code runs a factor of 2 to 3
> times slower than comparable Fortran/C code (at least on my Sparc20).
> For example:
> 
> (defun inner-product (a b)
>   (declare (type (simple-array single-float (*)) a b)
> 	   (optimize (speed 3) (safety 0) (space 0) (debug 0)))
>   (let ((n (array-dimension a 0))
> 	(acc 0.0))
>     (declare (fixnum n)
> 	     (single-float acc))
>     (dotimes (i n acc)
>        (declare (fixnum i))
>        (incf acc (* (aref a i) (aref b i))))))
> 
> will be within a factor of two of equivalent Fortran/C code.
> Disassembling the compiled code seems to indicate that the slow-down
> is mainly due to index-arithmetic for the array accesses.
> 
> It seems to me that this situation can be improved if the programmer
> is given access to pointers that point to a specific array element.
> (I followed the recent thread on locatives and this did not seem to be
> mentioned.)

There is no standard, efficient way to do this.  Indeed, I don't
know of any Common Lisp on "stock" hardware (ie, not Lisp machines)
that has efficient locatives (or equivalent entities).

(I'd be glad to hear otherwise, of course.)

Some suggestions:

* In some implementations (Lucid/Liquid), it can make a big
  difference if you optimize (compilation-speed 0).

* In some implementations, dotimes cannot be relied on to do fixnum
  arithmetic for its manipulation of the loop variable even if you
  declare that the variable is a fixnum.  You have write the addition
  and test out explicitly yourself, making sure to say the result of
  adding 1 to i is a fixnum, like this:

     (setq i (the fixnum (+ 1 i)))

  [I recommend writing a fix-dotimes, or dotimes-fixmnum or
  whatever name you prefer, macro to handle such things.]

* You may have to do something similar to

   (incf acc (* (aref a i) (aref b i)))

  to promise that the result will be a single-float.

* It's also possible that fixnum is not the most efficient type.
  (I think CMU CL prefers an (unsigned 0 32) or something like
  that.)

  Or maybe double float arrays are handled more efficiently than
  single-float ones.

Anyway, in GCL, which is not a highly optimizing implementation,
AREF of (ARRAY SHORT-FLOAT) and FIXNUM with a known-to-be SHORT-FLOAT
result compiles as the following C fragment "(#0)->sfa.sfa_self[#1]")
where #0 will be something that has the array as its value, and #1
will be something that has an int index as its value.  The C compiler
can optimize this further, of course, and so the result ought to be
fairly efficient.  Other implementations with good compilers ought to
be able to do at least as well.

-- j

From: Duane Rettig
Subject: Re: Faster array access in loops
Date: Fri, 10 Dec 1999 00:00:00 +0000
Message-ID: <44sdruom4.fsf@beta.franz.com>

Shiv <····@balrog.ece.ucsb.edu> writes:

> Recently I have been writing some number crunching code and have
> observed (as is well-known) that the Lisp code runs a factor of 2 to 3
> times slower than comparable Fortran/C code (at least on my Sparc20).
   [example removed]
> will be within a factor of two of equivalent Fortran/C code.
> Disassembling the compiled code seems to indicate that the slow-down
> is mainly due to index-arithmetic for the array accesses.

Yes.  This is due to the fact that "power reduction" is not being
done in lisps.  Power reduction is the kind of loop optimization
where a compiler takes what it knows about one or more loop variables
and array accesses, and shifts accesses around to reduce the actual
operations required within the loop(s).  This can be carried to amazing
levels, especially when vectorizing hardware is available which allow
strides to be specified.  In this case the arrays are one-dimensional,
and the stride is (fixnum)1, so the optimization is relatively easy.

> It seems to me that this situation can be improved if the programmer
> is given access to pointers that point to a specific array element.

Using locatives becomes a problem in GP hardware whenever a copying gc
might "see" into the stack and/or registers during the execution of
the loop.  It is usually hard for a simple locative (something that
would fit into a register) to be associated with the object that
it is pointing into, so that when the object is moved the locative
moves with it.  You can note that in the lisp code, the array pointers
always point to the "beginning" (modulo the tag) of the arrays, and since
fixnums are also tagged they are immune to gc.  Thus, these locatives
are only transiently active within atomic operations such as the actual
aref operation.

Now, in Allegro CL, we _do_ have one kind of locative: the program-counter
address.  It is usually used for start addresses or return addresses,
but this locative type's rules include the idea that whenever the gc
might see it, it is always in one or two specific registers that only
hold pc-addresses, or else on specific places in the stack.  This way,
if a code vector ever moves, all return addresses to within that code
vector will also move properly.

> My question is: Are such operations easy (for compiler writers) to
> provide (if at all)?  By 'easy', I mean if existing compilers can be
> easily extended to accommodate this.
> 
> --shiv--

It may be possible, if a gc can be guaranteed not to occur, to provide
strength reduction optimizations on array-intensive loops.  Again in
Allegro CL, we have added a macro called excl:atomically, which can be
wrapped around any form that you want to compile.  If that code compiles
without error, it is atomic with respect to the garbage collector, and
such strength reduction could conceivably be done.  In your example
function, the dotimes is such an atomic loop, when speed is 3 and
safety is 0.

> PS: If GC-issues are not a problem, it seems the difficulty-level
> should be on par with treating floats: that is, boxing and UN-boxing
> the array-locatives.

Well, the gc issues are really the problem, so it's not quite the
same.  Boxing allows you to "hide" immediate values from the gc
that should not change.  Locatives go the opposite way: they _want_
to change when the object under them is moved by the gc.  But if
you can _avoid_ the possibility of gc while these locatives (really
just pointers into the middle of the arrays) are active, then it is
possible.

-- 
Duane Rettig          Franz Inc.            http://www.franz.com/ (www)
1995 University Ave Suite 275  Berkeley, CA 94704
Phone: (510) 548-3600; FAX: (510) 548-8253   ·····@Franz.COM (internet)

From: Bernhard Pfahringer
Subject: Re: Faster array access in loops
Date: Fri, 10 Dec 1999 00:00:00 +0000
Message-ID: <82qmsm$3drc$1@www.univie.ac.at>

In article <·············@beta.franz.com>,
Duane Rettig  <·····@franz.com> wrote:
>
>> PS: If GC-issues are not a problem, it seems the difficulty-level
>> should be on par with treating floats: that is, boxing and UN-boxing
>> the array-locatives.
>
>Well, the gc issues are really the problem, so it's not quite the
>same.  Boxing allows you to "hide" immediate values from the gc
>that should not change.  Locatives go the opposite way: they _want_
>to change when the object under them is moved by the gc.  But if
>you can _avoid_ the possibility of gc while these locatives (really
>just pointers into the middle of the arrays) are active, then it is
>possible.
>

So locatives into static arrays (an ACL extension) should be safe.
If there were a way to declare say (TYPE (simple-array FIXNUM (*) STATIC)),
then the compiler could be extra-smart.

Bernhard
-- 
--------------------------------------------------------------------------
Bernhard Pfahringer
Austrian Research Institute for  http://www.ai.univie.ac.at/~bernhard/
Artificial Intelligence          ········@ai.univie.ac.at

From: Duane Rettig
Subject: Re: Faster array access in loops
Date: Fri, 10 Dec 1999 00:00:00 +0000
Message-ID: <47limai5d.fsf@beta.franz.com>

········@hummel.ai.univie.ac.at (Bernhard Pfahringer) writes:

> In article <·············@beta.franz.com>,
> Duane Rettig  <·····@franz.com> wrote:
> >
> >> PS: If GC-issues are not a problem, it seems the difficulty-level
> >> should be on par with treating floats: that is, boxing and UN-boxing
> >> the array-locatives.
> >
> >Well, the gc issues are really the problem, so it's not quite the
> >same.  Boxing allows you to "hide" immediate values from the gc
> >that should not change.  Locatives go the opposite way: they _want_
> >to change when the object under them is moved by the gc.  But if
> >you can _avoid_ the possibility of gc while these locatives (really
> >just pointers into the middle of the arrays) are active, then it is
> >possible.
> >
> 
> So locatives into static arrays (an ACL extension) should be safe.

Yes.

And I failed to mention last posting, that while you are waiting for
us to actually make use of atomicity and immobility, with all of our
infinite time and resources :-), you can make use of the
comp::*hack-compiler-output* feature I've described a couple of times
previously on this ng to roll (or unroll :-) your own simple loop
rewrites.

> If there were a way to declare say (TYPE (simple-array FIXNUM (*) STATIC)),
> then the compiler could be extra-smart.

Yes, but I wouldn't munge the type system to do this; instead I would
prefer adding a separate STATIC declaration, such as

 (declare (type (simple-array fixnum (*)) a b c)
          (static a b))

-- 
Duane Rettig          Franz Inc.            http://www.franz.com/ (www)
1995 University Ave Suite 275  Berkeley, CA 94704
Phone: (510) 548-3600; FAX: (510) 548-8253   ·····@Franz.COM (internet)

From: Duane Rettig
Subject: Re: Faster array access in loops
Date: Fri, 10 Dec 1999 00:00:00 +0000
Message-ID: <4r9gue7ha.fsf@beta.franz.com>

Duane Rettig <·····@franz.com> writes:

> strength reduction optimizations on array-intensive loops.  Again in
> Allegro CL, we have added a macro called excl:atomically, which can be
> wrapped around any form that you want to compile.  If that code compiles

I should correct myself; the name of the macro is excl::atomically
(note that it is not exported).  It is thus not documented either, except
for within special "inside Allegro CL" or advanced tutorials or courses.
The concept is simple enough, however, and so one could make some
guesses as to how to use it.

-- 
Duane Rettig          Franz Inc.            http://www.franz.com/ (www)
1995 University Ave Suite 275  Berkeley, CA 94704
Phone: (510) 548-3600; FAX: (510) 548-8253   ·····@Franz.COM (internet)