toy floating point performance test

From: R. Matthew Emerson
Subject: toy floating point performance test
Date: Wed, 04 Aug 1999 00:00:00 +0000
Message-ID: <87emhjpmvm.fsf@nightfly.apk.net>

I wrote a Lisp function implementing the forward discrete cosine
transform (as used in JPEG), and compared it to the C version from the
well-known IJG JPEG library.

If we say that the C version takes time 1.0, then CMUCL takes time
2.4, and ACL 5.0.1 takes time 3.86.  This is on my 400MHz Pentium II
running FreeBSD.  I don't know if this proves anything significant, but
there it is.

The code is at http://www.thoughtstuff.com/rme/lisp.html.

-matt

Re: toy floating point performance test Barry Margolin
- Re: toy floating point performance test Raymond Toy
Re: toy floating point performance test Chuck Fry
- Re: toy floating point performance test Raymond Toy
Re: toy floating point performance test Marco Antoniotti

From: Barry Margolin
Subject: Re: toy floating point performance test
Date: Wed, 04 Aug 1999 00:00:00 +0000
Message-ID: <2CZp3.46$ej1.7483@burlma1-snr2>

In article <··············@nightfly.apk.net>,
R. Matthew Emerson <···@nightfly.apk.net> wrote:
>I wrote a Lisp function implementing the forward discrete cosine
>transform (as used in JPEG), and compared it to the C version from the
>well-known IJG JPEG library.
>
>If we say that the C version takes time 1.0, then CMUCL takes time
>2.4, and ACL 5.0.1 takes time 3.86.  This is on my 400MHz Pentium II
>running FreeBSD.  I don't know if this proves anything significant, but
>there it is.

Try adding declarations for all your temporary variables that only hold
floats, e.g.

      (let ((q0 (+ p0 p3))
            (q1 (+ p1 p2))
            (q2 (- p1 p2))
            (q3 (- p0 p3)))
        (declare (type single-float q0 q1 q2 q3))

CMUCL is pretty good at doing data flow analysis to determine some of these
declarations automatically; I think there's a way to get its compiler to
tell you what declaration propagation it has done and couldn't do.

-- 
Barry Margolin, ······@bbnplanet.com
GTE Internetworking, Powered by BBN, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: Raymond Toy
Subject: Re: toy floating point performance test
Date: Wed, 04 Aug 1999 00:00:00 +0000
Message-ID: <4niu6vz32i.fsf@rtp.ericsson.se>

>>>>> "Barry" == Barry Margolin <······@bbnplanet.com> writes:

    Barry> In article <··············@nightfly.apk.net>,
    Barry> R. Matthew Emerson <···@nightfly.apk.net> wrote:
    >> I wrote a Lisp function implementing the forward discrete cosine
    >> transform (as used in JPEG), and compared it to the C version from the
    >> well-known IJG JPEG library.
    >> 
    >> If we say that the C version takes time 1.0, then CMUCL takes time
    >> 2.4, and ACL 5.0.1 takes time 3.86.  This is on my 400MHz Pentium II
    >> running FreeBSD.  I don't know if this proves anything significant, but
    >> there it is.

    Barry> Try adding declarations for all your temporary variables that only hold
    Barry> floats, e.g.

    Barry>       (let ((q0 (+ p0 p3))
    Barry>             (q1 (+ p1 p2))
    Barry>             (q2 (- p1 p2))
    Barry>             (q3 (- p0 p3)))
    Barry>         (declare (type single-float q0 q1 q2 q3))

Actually, no other declarations are needed in CMUCL.  From the type of 
B, the compiler can derive the types of all other variables in the
code.

As for why CMUCL is so much slower than C, I can't say.  Perhaps the
2-D array access is not as good as it could be?  On an Ultrasparc 300
Mhz, CMUCL says 1.12 sec for the test, but gcc gives 0.61 sec (ratio
of 1.8).

    Barry> CMUCL is pretty good at doing data flow analysis to
    Barry> determine some of these declarations automatically; I think
    Barry> there's a way to get its compiler to tell you what
    Barry> declaration propagation it has done and couldn't do.

I'm not aware of any such way, but the compiler notes produced from
compiling the code will have lots of hints on what the compiler thinks
are the types of things.  It's pretty easy after that to figure out
what declarations to put in.

Ray

From: Chuck Fry
Subject: Re: toy floating point performance test
Date: Wed, 04 Aug 1999 00:00:00 +0000
Message-ID: <7o9vf6$lcu$1@shell5.ba.best.com>

In article <··············@nightfly.apk.net>,
R. Matthew Emerson <···@nightfly.apk.net> wrote:
>I wrote a Lisp function implementing the forward discrete cosine
>transform (as used in JPEG), and compared it to the C version from the
>well-known IJG JPEG library.
>
>If we say that the C version takes time 1.0, then CMUCL takes time
>2.4, and ACL 5.0.1 takes time 3.86.  This is on my 400MHz Pentium II
>running FreeBSD.  I don't know if this proves anything significant, but
>there it is.
>
>The code is at http://www.thoughtstuff.com/rme/lisp.html.
>
>-matt

I took a brief look at your function FORWARD-DCT! and tried to tune it a
bit for ACL 5.0 on my Sparc.

First, I looked at the compiler's type information using 
 (declare (:explain :types))
which is an Allegro-specific extension that causes type information as
observed by the compiler to be displayed during compilation.

Everything was being recognized as the appropriate type, so there wasn't
anything to gain here.

Looking at the disassembled code, I noticed a couple of things.  First,
the good news: all the arithmetic was being coded in-line, and the
floats weren't being boxed.  There's no way I know of to further improve
the speed of the arithmetic.

Now the bad news: each array reference required three memory references.
This is due to ACL's choice of representation for 2D arrays.  The C
version should only require a single memory reference per array
reference.  This is the cause of the bulk of the performance difference.

I tried to reimplement the array accesses using ROW-MAJOR-AREF and
ARRAY-ROW-MAJOR-INDEX.  This was a pessimization, as these functions
aren't inlined, even at safety=0 and speed=3.  Even if they had been,
it's not clear this would reduce the number of memory references
required for each array reference.  Likewise, using a displaced array
would result in the same three memory references per array reference.

Then I took a close look at your code.  As coded, the function
FORWARD-DCT! does twice as many array reads as necessary.  I found that
if I cached the contents of each row or column in local variables, and
then operated on those cached values instead of reading directly from
the array, I could eliminate half of the array reads.  This yielded a
20% reduction in run time overall.

The Sparc's large number of floating point registers made this approach
possible.  It should work on any processor with a large (n >= 16) set of
FP registers.  I don't know if the same approach will work on
Pentium-class processors.  It may help to reorganize the code so that
the array references are performed only when their values are needed.

Further performance improvement in ACL will require Franz's help,
specifically if they were to provide a non-consing way to access the
storage vector of a 2D array directly.  Duane?

 -- Chuck
-- 
	    Chuck Fry -- Jack of all trades, master of none
 ······@chucko.com (text only please)  ········@home.com (MIME enabled)
Lisp bigot, mountain biker, car nut, sometime guitarist and photographer
The addresses above are real.  All spammers will be reported to their ISPs.

From: Raymond Toy
Subject: Re: toy floating point performance test
Date: Wed, 04 Aug 1999 00:00:00 +0000
Message-ID: <4ng11zz0l8.fsf@rtp.ericsson.se>

>>>>> "Chuck" == Chuck Fry <······@best.com> writes:

    Chuck> the good news: all the arithmetic was being coded in-line,
    Chuck> and the floats weren't being boxed.  There's no way I know

Likewise for CMUCL on Sparc and x86.

    Chuck> I tried to reimplement the array accesses using
    Chuck> ROW-MAJOR-AREF and ARRAY-ROW-MAJOR-INDEX.  This was a
    Chuck> pessimization, as these functions aren't inlined, even at

I think CMUCL converts all array references to row-major-aref anyway,
so this would not change anything.  row-major-aref is inlined too, I
think.

I took this idea one step further:  make the array a 1-d array.  This
was a major win.  Time went from 1.21 sec to 0.77 sec.

    Chuck> necessary.  I found that if I cached the contents of each
    Chuck> row or column in local variables, and then operated on
    Chuck> those cached values instead of reading directly from the
    Chuck> array, I could eliminate half of the array reads.  This
    Chuck> yielded a 20% reduction in run time overall.

Doing the same on CMUCL reduced the time to 0.71 sec.  To compare, gcc 
2.7.2 -O6 produced 0.56 sec, Sun CC 4.0 -xO5 produced 0.32, and
egcs-1.1b -O6 produced 0.35 sec.  So, CMUCL is 1.26 of gcc, 2.2 of Sun
CC, and 2.0 of egcs.  Not bad, but not great either.

On Linux, CMUCL produces 0.69 sec, and gcc 2.7.2 -O6 produces 0.42 sec,
a ratio of 1.6.

Note: timing test-dct includes the time necessary for level shifting
the image.  I don't know how significant that is compared to the DCT
itself.

Ray

From: Marco Antoniotti
Subject: Re: toy floating point performance test
Date: Thu, 05 Aug 1999 00:00:00 +0000
Message-ID: <lwoggm4l6i.fsf@copernico.parades.rm.cnr.it>

···@nightfly.apk.net (R. Matthew Emerson) writes:

> I wrote a Lisp function implementing the forward discrete cosine
> transform (as used in JPEG), and compared it to the C version from the
> well-known IJG JPEG library.
> 
> If we say that the C version takes time 1.0, then CMUCL takes time
> 2.4, and ACL 5.0.1 takes time 3.86.  This is on my 400MHz Pentium II
> running FreeBSD.  I don't know if this proves anything significant, but
> there it is.
> 
> The code is at http://www.thoughtstuff.com/rme/lisp.html.
> 
> -matt
> 

I run a very simple test on the code.  I.e. I rewrote the CL code into
'regular' C, but without the pointer "optimizations" used in the
original C code.

Here are the results of a

SunOS copernico 5.5.1 Generic_103640-18 sun4u sparc SUNW,Ultra-2

using gcc 2.7.2.2 and CMUCL 18b.

In CMUCL

==============================================================================
* (time (test-dct *image*))
Evaluation took:
  1.57 seconds of real time
  1.57 seconds of user run time
  0.0 seconds of system run time
  0 page faults and
  352 bytes consed.
NIL
==============================================================================

while on the shell running the modified C program (compiled with 'gcc
-O3') I get

==============================================================================
·······@copernico:cl 53> gcc -O3 test-dct-2d.c
·······@copernico:cl 54> time a.out
1.11u 0.01s 0:01.17 95.7%
==============================================================================

This data point would place CMUCL ~1.41 of GCC.  Not bad, but not very
good either.

Cheers

-- 
Marco Antoniotti ===========================================
PARADES, Via San Pantaleo 66, I-00186 Rome, ITALY
tel. +39 - 06 68 10 03 17, fax. +39 - 06 68 80 79 26
http://www.parades.rm.cnr.it/~marcoxa