From: Didier Verna
Subject: Re: How to make Lisp go faster than C
Date: 
Message-ID: <muxk57iiduv.fsf@uzeb.lrde.epita.fr>
Francogrex <······@grex.org> wrote:

> Read the Research and Development article by Didier Verna:
> http://www.lrde.epita.fr/~didier/research/verna.06.imecs.pdf

  FWIW, try s/imecs/ecoop/. It's the director's cut version. Also, note
that I'm going to present the sequel at ILC.

-- 
European Lisp Symposium, May 2009: http://www.european-lisp-symposium.org

Scientific site:   http://www.lrde.epita.fr/~didier
Music (Jazz) site: http://www.didierverna.com

From: Scott
Subject: Re: How to make C go faster than C
Date: 
Message-ID: <c39ce546-dc0c-4cb2-a7bd-90e447876ee5@c36g2000yqn.googlegroups.com>
I read the paper, but I don't think Lisp is going to catch up to C by
doing a speed comparisons where you don't put the latest C compilers
into high gear.  Here's an obvious extreme (because it's doing
arithmetic on unsigned bytes), but I think it's fair since you suggest
replacing C in image processing (which frequently uses unsigned
bytes):

    static void add1 (uint8_t* to, uint8_t* from, uint8_t val, int nn)
{
        int ii;
        for (ii = 0; ii<nn; ++ii) {
            to[ii] = from[ii] + val;
        }
    }

That readily becomes:

    static void add2 (uint8_t* to, uint8_t* from, uint8_t val, int nn)
{

        // Just a few "declarations" :-)
        typedef uint8_t v16ub __attribute__ ((vector_size(16)));
        v16ub* t = (v16ub*)to;
        v16ub* f = (v16ub*)from;
        v16ub v = {
            val, val, val, val, val, val, val, val,
            val, val, val, val, val, val, val, val
        };

        int ii;
        for (ii = 0; ii<nn/16; ++ii) {
            t[ii] = f[ii] + v;
        }
    }


This C is about 8 times faster than the C above (which is intended to
be the same as in your paper).  The other cases (floats and out of
order accesses) don't win as handily, but if you were really doing
image processing, you would happily take these gains and probably
accept the 16 byte alignment restrictions.

By the way, the link in your paper (footnote 3 at the bottom of page
2) seems to be incorrect.  I think it's missing the "s" in "publis":

    http://www.lrde.epita.fr/~didier/comp/research/publis.php

I also didn't find the source code on your site.

Cheers,
    -Scott
From: Jeff M.
Subject: Re: How to make C go faster than C
Date: 
Message-ID: <7079f865-b0aa-474d-96fa-4dd104640ac3@j1g2000yqi.googlegroups.com>
On Feb 22, 3:33 pm, Scott <·······@gmail.com> wrote:
> I read the paper, but I don't think Lisp is going to catch up to C by
> doing a speed comparisons where you don't put the latest C compilers
> into high gear.  Here's an obvious extreme (because it's doing
> arithmetic on unsigned bytes), but I think it's fair since you suggest
> replacing C in image processing (which frequently uses unsigned
> bytes):
>
>     static void add1 (uint8_t* to, uint8_t* from, uint8_t val, int nn)
> {
>         int ii;
>         for (ii = 0; ii<nn; ++ii) {
>             to[ii] = from[ii] + val;
>         }
>     }
>
> That readily becomes:
>
>     static void add2 (uint8_t* to, uint8_t* from, uint8_t val, int nn)
> {
>
>         // Just a few "declarations" :-)
>         typedef uint8_t v16ub __attribute__ ((vector_size(16)));
>         v16ub* t = (v16ub*)to;
>         v16ub* f = (v16ub*)from;
>         v16ub v = {
>             val, val, val, val, val, val, val, val,
>             val, val, val, val, val, val, val, val
>         };
>
>         int ii;
>         for (ii = 0; ii<nn/16; ++ii) {
>             t[ii] = f[ii] + v;
>         }
>     }
>
> This C is about 8 times faster than the C above (which is intended to
> be the same as in your paper).  The other cases (floats and out of
> order accesses) don't win as handily, but if you were really doing
> image processing, you would happily take these gains and probably
> accept the 16 byte alignment restrictions.

Throw in a couple __restrict keywords for from and to to avoid pointer
aliasing and the compiler will do MUCH better as well, getting rid of
LHS stalls.

Jeff M.
From: Scott
Subject: Re: How to make C go faster than C
Date: 
Message-ID: <ee0b0a01-fd88-452e-9108-3ecddf276ed5@d32g2000yqe.googlegroups.com>
On Feb 23, 12:22 pm, "Jeff M." <·······@gmail.com> wrote:
>
> Throw in a couple __restrict keywords for from and to to avoid pointer
> aliasing and the compiler will do MUCH better as well, getting rid of
> LHS stalls.
>

I profiled and disassembled it with "restrict"s (with -std=c99) and
without.  I also threw -fno-inline in there so GCC wouldn't cheat too
much.  It didn't make much difference any way about it, so I decided
to keep it simpler.  I think GCC starts assuming this stuff when you
use -O3 (which is what the author specified in the article).


Cheers,
    -Scott
From: Jeff M.
Subject: Re: How to make C go faster than C
Date: 
Message-ID: <5b784478-f468-44f0-bbcf-c75c3b875011@v38g2000yqb.googlegroups.com>
On Feb 23, 6:07 pm, Scott <·······@gmail.com> wrote:
> On Feb 23, 12:22 pm, "Jeff M." <·······@gmail.com> wrote:
>
>
>
> > Throw in a couple __restrict keywords for from and to to avoid pointer
> > aliasing and the compiler will do MUCH better as well, getting rid of
> > LHS stalls.
>
> I profiled and disassembled it with "restrict"s (with -std=c99) and
> without.  I also threw -fno-inline in there so GCC wouldn't cheat too
> much.  It didn't make much difference any way about it, so I decided
> to keep it simpler.  I think GCC starts assuming this stuff when you
> use -O3 (which is what the author specified in the article).

What processor? I imagine x86. Likely doing the same thing on an in-
order execution machine will yield very different results.

However, your results are interesting to note. If GCC is assuming
__restrict at -O3, that could end up yielding very non-functional code
for some algorithms.

Jeff M.
From: George Neuner
Subject: Re: How to make Lisp go faster than C
Date: 
Message-ID: <5u73q4hsc716bu2k7cjk94va7tem9vm497@4ax.com>
On Sun, 22 Feb 2009 18:24:40 +0100, Didier Verna
<······@lrde.epita.fr> wrote:

>> http://www.lrde.epita.fr/~didier/research/verna.06.imecs.pdf

Was this article scanned or produced with a PDF writer?  I tried to
read it with Adobe 8 on a laptop and the text was illegible until
viewed at 300% magnification.  The characters didn't become solid
until 600% magnification.  Printed ok though.

And fortunately it's short.  I am currently trying to slodge through a
900 page PDF that I can only view 1/3 of a page at a time.

George
From: Waldek Hebisch
Subject: Re: How to make Lisp go faster than C
Date: 
Message-ID: <gnt8qu$3kp$1@z-news.wcss.wroc.pl>
Didier Verna <······@lrde.epita.fr> wrote:
> Francogrex <······@grex.org> wrote:
> 
> > Read the Research and Development article by Didier Verna:
> > http://www.lrde.epita.fr/~didier/research/verna.06.imecs.pdf
> 
>   FWIW, try s/imecs/ecoop/. It's the director's cut version. Also, note
> that I'm going to present the sequel at ILC.
> 

I appreciate information about optimizing Lisp code contained
in this article. However I consider the claim "Beating C in Scientific
Computing Applications" to be misleading.  My personal experience
using both SBCL and C (gcc) is that SBCL generates significantly worse
code than gcc:

1) SBCL in many cases generates shifts/masks due to tags in Lisp
   integers.  Declaring integers as (unsigned byte 64) (or 32 on
   32-bit machines) helps in some cases, but other remain.
2) SBCL frequently generetes "useless" instructions: extra copies,
   recomputes sub-expressions which did not change, can not exploit
   addressiong modes etc.  In other words, SBCL code generator is much
   less sophisticated than gcc one.

I would say that in compute-bound code it is hard to get within
factor of 2 using SBCL compared to gcc.  Real code is frequenty
memory-bound and this tends to significantly decrease performance
gap.  However, in many important cases it is possible to
design algorithms so that data is available in L1 cache -- such
algorithms get full benefit from better code generation.  And
on code which is doing many memory accesses gcc will still
have advantage of smaller number of non-memory operations.

Extra remark about the paper:

- C code for add in Listing 1 is pretty inefficient.  One thing
  is mixing of integer and floating point operations.  Another,
  probably more important, is use of double indirection, which
  forces 4 memory accesses in inner loop.

So kudos for serious work on improving Lisp performace.  And
kudos to CMU CL and SBCL developers for good compilers.  But
guys, get real, there is long way before for Lisp technology can
match C.

-- 
                              Waldek Hebisch
·······@math.uni.wroc.pl 
From: Dimiter "malkia" Stanev
Subject: Re: How to make Lisp go faster than C
Date: 
Message-ID: <gnuqql$g9e$1@malkia.motzarella.org>
Ditto, what Waldek is saying about SBCL (and other Lisp Compilers).

For example one thing that bothers me, is that depending on the memory 
address where the function was compiled it can run slower of faster. 
Simple explnatation: any loop (jmp, jne, etc. function) that does not go 
to a 16-th aligned boundary is going to get slower. So depending on 
where the function compiled (1/4 times) it would get significantly 
faster. (Wonder how they get proper boinkor results everytime).

But then again it's one of the best compilers I've seen for such a 
dynamic language.

Waldek Hebisch wrote:
> Didier Verna <······@lrde.epita.fr> wrote:
>> Francogrex <······@grex.org> wrote:
>>
>>> Read the Research and Development article by Didier Verna:
>>> http://www.lrde.epita.fr/~didier/research/verna.06.imecs.pdf
>>   FWIW, try s/imecs/ecoop/. It's the director's cut version. Also, note
>> that I'm going to present the sequel at ILC.
>>
> 
> I appreciate information about optimizing Lisp code contained
> in this article. However I consider the claim "Beating C in Scientific
> Computing Applications" to be misleading.  My personal experience
> using both SBCL and C (gcc) is that SBCL generates significantly worse
> code than gcc:
> 
> 1) SBCL in many cases generates shifts/masks due to tags in Lisp
>    integers.  Declaring integers as (unsigned byte 64) (or 32 on
>    32-bit machines) helps in some cases, but other remain.
> 2) SBCL frequently generetes "useless" instructions: extra copies,
>    recomputes sub-expressions which did not change, can not exploit
>    addressiong modes etc.  In other words, SBCL code generator is much
>    less sophisticated than gcc one.
> 
> I would say that in compute-bound code it is hard to get within
> factor of 2 using SBCL compared to gcc.  Real code is frequenty
> memory-bound and this tends to significantly decrease performance
> gap.  However, in many important cases it is possible to
> design algorithms so that data is available in L1 cache -- such
> algorithms get full benefit from better code generation.  And
> on code which is doing many memory accesses gcc will still
> have advantage of smaller number of non-memory operations.
> 
> Extra remark about the paper:
> 
> - C code for add in Listing 1 is pretty inefficient.  One thing
>   is mixing of integer and floating point operations.  Another,
>   probably more important, is use of double indirection, which
>   forces 4 memory accesses in inner loop.
> 
> So kudos for serious work on improving Lisp performace.  And
> kudos to CMU CL and SBCL developers for good compilers.  But
> guys, get real, there is long way before for Lisp technology can
> match C.
>