Can Windows CL compilers be as fast as CMUCL ?

From: ···········@mail.ru
Subject: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 13:55:52 +0000
Message-ID: <1108475752.660004.23870@c13g2000cwb.googlegroups.com>

Hi,

Since found Common Lisp I learned it from time to time. However I have
to use C/C++ in daily basis. I've started to promote Common Lisp and
didn't encounter any resistance from my boss. You can understand what
difficulties I met using Common Lisp for the first application (it was
not about Lisp but more likely becase C/C++ habits).

Today I know much more than years ago. This year my co-workers started
to escape from C++ to C# and pretty happy with it. Since .NET runtime
is not smaller than any Lisp image, today I don't think much about my
applications size. But the rest is SPEED.

Question is... I saw CMUCL and found it's performance on image
processing could be equal to optimized C code. But there is no such
thing on Windows. Why ? Allegro CL, LispWorks, Corman Lisp - are too
slow in comparison with optimized C/C++ code. I've heard some arguments
about it (for example messages from Duane Rettig about code
optimization on modern CPUs) , but still wonder why can not commercial
vendors (even Franz) improve code generation to match performance with
competitors? I would rewrite my C/C++ DLLs in Common Lisp and wouldn't
use FFI so much. I belive it is possible (some day). 

Regards
Lisper

Re: Can Windows CL compilers be as fast as CMUCL ? Marc Battyani
- Re: Can Windows CL compilers be as fast as CMUCL ? ···········@mail.ru
  - Re: Can Windows CL compilers be as fast as CMUCL ? ···········@mail.ru
    - Re: Can Windows CL compilers be as fast as CMUCL ? ···········@mail.ru
      - Re: Can Windows CL compilers be as fast as CMUCL ? Edi Weitz
        Re: Can Windows CL compilers be as fast as CMUCL ? ···········@mail.ru
    - Re: Can Windows CL compilers be as fast as CMUCL ? Brian Downing
      - Re: Can Windows CL compilers be as fast as CMUCL ? ···········@mail.ru
  - Re: Can Windows CL compilers be as fast as CMUCL ? Marc Battyani
    - Re: Can Windows CL compilers be as fast as CMUCL ? ···········@mail.ru
      - Re: Can Windows CL compilers be as fast as CMUCL ? Edi Weitz
        Re: Can Windows CL compilers be as fast as CMUCL ? Paul F. Dietz
    - Re: Can Windows CL compilers be as fast as CMUCL ? ···········@mail.ru
- Re: Can Windows CL compilers be as fast as CMUCL ? Wade Humeniuk
  - Re: Can Windows CL compilers be as fast as CMUCL ? Wade Humeniuk
  - Re: Can Windows CL compilers be as fast as CMUCL ? Gisle SÃ¦lensminde
    - Re: Can Windows CL compilers be as fast as CMUCL ? Carl Shapiro
    - Re: Can Windows CL compilers be as fast as CMUCL ? Wade Humeniuk
      - Re: Can Windows CL compilers be as fast as CMUCL ? Bulent Murtezaoglu
        Re: Can Windows CL compilers be as fast as CMUCL ? Wade Humeniuk
        Re: Can Windows CL compilers be as fast as CMUCL ? Wade Humeniuk
        Re: Can Windows CL compilers be as fast as CMUCL ? Edi Weitz
  - Re: Can Windows CL compilers be as fast as CMUCL ? Harald Hanche-Olsen
    - Re: Can Windows CL compilers be as fast as CMUCL ? Bulent Murtezaoglu
      - Re: Can Windows CL compilers be as fast as CMUCL ? rif
        Re: Can Windows CL compilers be as fast as CMUCL ? Raymond Toy
  - Re: Can Windows CL compilers be as fast as CMUCL ? Camm Maguire
- Re: Can Windows CL compilers be as fast as CMUCL ? Bulent Murtezaoglu
  - Re: Can Windows CL compilers be as fast as CMUCL ? Marc Battyani

From: Marc Battyani
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 14:23:46 +0000
Message-ID: <cut0l3$phf@library2.airnews.net>

<···········@mail.ru> wrote
> thing on Windows. Why ? Allegro CL, LispWorks, Corman Lisp - are too
> slow in comparison with optimized C/C++ code. I've heard some arguments
> about it (for example messages from Duane Rettig about code
> optimization on modern CPUs) , but still wonder why can not commercial
> vendors (even Franz) improve code generation to match performance with
> competitors?

Maybe because they already generate very good code ?
This topic is a recurrent one here so search with google you will find lots
of examples.

Here is one from a recent discussion on the Lispworks mailing list:

(defun multiply-array-by-scalar4 (array scalar)
  (declare (type (simple-array single-float (*)) array)
           (type single-float scalar)
           (optimize (speed 3) (safety 0) (debug 0) (float 0)))
  (loop for i fixnum below (the fixnum (length array)) do
       (setf (aref array i) (* scalar (aref array i)))))

LWW:
       0:      55               push  ebp
       1:      89E5             move  ebp, esp
       3:      83EC18           sub   esp, 18
       6:      C7042445180000   move  [esp], 1845
      13:      8B5D08           move  ebx, [ebp+8]
      16:      DD4004           fldl  [eax+4]
      19:      DD5DEC           fstpl [ebp-14]
      22:      8B430C           move  eax, [ebx+C]
      25:      33FF             xor   edi, edi
      27:      3BF8             cmp   edi, eax
      29:      7C0A             jl    L2
L1:   31:      B810000000       move  eax, 10
      36:      FD               std
      37:      C9               leave
      38:      C20400           ret   4
L2:   41:      89FA             move  edx, edi
      43:      C1FA06           sar   edx, 6
      46:      D9441310         flds  [ebx+10+edx]
      50:      DC4DEC           fmull [ebp-14]
      53:      89FA             move  edx, edi
      55:      C1FA06           sar   edx, 6
      58:      D95C1310         fstps [ebx+10+edx]
      62:      81C700010000     add   edi, 100
      68:      3BF8             cmp   edi, eax
      70:      7DD7             jge   L1
      72:      EBDF             jmp   L2

ACL 7.0:
   0: d9 42 f2      fldf [edx-14]
   3: dd da         fstp st(2)
   5: 8b 58 f2      movl     ebx,[eax-14]
   8: 33 d2         xorl    edx,edx
  10: eb 1b         jmp     39
  12: d9 44 10 f6   fldf [eax+edx-10]
  16: dd db         fstp st(3)
  18: d9 af 4b      fd fldcwf [edi-693] ; SYS::SINGLE_CONVERTER
      ff ff
  24: d9 c2         fld st,st(2)
  26: d8 ca         fmul st,st(2)
  28: dd db         fstp st(3)
  30: d9 c2         fld st,st(2)
  32: d9 5c 10 f6   fstpf [eax+edx-10]
  36: 83 c2 04      addl     edx,$4
  39: 3b d3         cmpl    edx,ebx
  41: 7c e1         jl      12
  43: 8b c7         movl    eax,edi
  45: f8            clc
  46: 8b 75 fc      movl     esi,[ebp-4]
  49: c3            ret

Does not look very slow.
OK those declaration are not pretty but you only need them in some parts.

> I would rewrite my C/C++ DLLs in Common Lisp and wouldn't
> use FFI so much. I belive it is possible (some day).

C++ is slow anyway. If you really want speed then program in VHDL or
Verilog.

Marc

From: ···········@mail.ru
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 19:28:33 +0000
Message-ID: <1108495713.519880.183380@o13g2000cwo.googlegroups.com>

Marc Battyani wrote:
> <···········@mail.ru> wrote
> > thing on Windows. Why ? Allegro CL, LispWorks, Corman Lisp - are
too
> > slow in comparison with optimized C/C++ code. I've heard some
arguments
> > about it (for example messages from Duane Rettig about code
> > optimization on modern CPUs) , but still wonder why can not
commercial
> > vendors (even Franz) improve code generation to match performance
with
> > competitors?
>
> Maybe because they already generate very good code ?
> This topic is a recurrent one here so search with google you will
find lots
> of examples.

May be "very good" is not enough ? OK. Let's evaluate performance:

(setf *array* (make-array 442368 :element-type 'single-float
:initial-element 2.0))

(defun sum-array (array)
  (declare (type (simple-array single-float (*)) array)
           (optimize (speed 3) (safety 0) (debug 0) (float 0)))
  (let ((sum 0.0)) (declare (type single-float sum))
     (loop for i fixnum below (the fixnum (length array)) do
       (incf sum (the single-float (aref array i))))
  sum))

(time (sum-array *array*))

**********************************************
Timing the evaluation of (SUM-ARRAY *ARRAY*)

user time    =      0.015
system time  =      0.000
Elapsed time =   0:00:00
Allocation   = 16 bytes standard / 0 bytes conses
0 Page faults

884736.0
**********************************************
Now C/C++:

float sum_elts(float *ar,int length)
{
    int i,j,k;
    float sum = 0.0;
    for (i=0;i<length;i++)
    {
        sum += ar[i];
    }
    return sum;
}

int main()
{
    int i;
    LARGE_INTEGER freq,st,en;
    float *ar = new float[442368];
    int length = 442368;
    for (i=0;i<length;i++)
    {
        ar[i]=2.0;
    }
    QueryPerformanceFrequency(&freq);
    QueryPerformanceCounter(&st);
    float sum = sum_elts(ar,length);
    QueryPerformanceCounter(&en);
    double time = ((double)(en.QuadPart - st.QuadPart)) /
freq.QuadPart;
    printf("%2.5lf seconds elapsed result: %lf\n",time,sum);

    delete [] ar;
    return 0;
}

0.00096 seconds elapsed result: 884736.000000

So C++ code is about 15 times faster than LispWorks version.

Now CMUCL:

* (time (sum-array *array*))
; [GC threshold exceeded with 30,693,280 bytes in use.  Commencing GC.]
; [GC completed with 19,894,104 bytes retained and 10,799,176 bytes
freed.]
; [GC will next occur when at least 31,894,104 bytes are in use.]
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:

; Evaluation took:
;   0.01 seconds of real time
;   0.002 seconds of user run time
;   0.0 seconds of system run time
;   5,719,264 CPU cycles
;   0 page faults and
;   8 bytes consed.
;
884736.0

CMUCL code is as fast as C++ version. So, what do I do wrong ?

I had already tested same code on Allegro CL 7.0 and results were not
better than on LispWorks (LispWorks was even slightly faster than ACL).


> Does not look very slow.
> OK those declaration are not pretty but you only need them in some
parts.

I know and it is OK.

> C++ is slow anyway. If you really want speed then program in VHDL or
> Verilog.

I want program on Common Lisp solely and I see that CMUCL can be as
fast as C++ and even faster. But unfortunately I need program on MS
Windows ;-(

Regards
Lisper

From: ···········@mail.ru
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 19:36:02 +0000
Message-ID: <1108496162.419536.262520@g14g2000cwa.googlegroups.com>

; Evaluation took:
;   0.01 seconds of real time
;   0.002 seconds of user run time

I was wrong, in this case CMUCL was 10 times slower than C++ version.
;-(
But As I checked before it show performance equal to C/C++ on integer
arithmetic.

I'll check it again....

Regards
Lisper

From: ···········@mail.ru
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 20:22:54 +0000
Message-ID: <1108498974.162307.53670@g14g2000cwa.googlegroups.com>

···········@mail.ru wrote:
> ; Evaluation took:
> ;   0.01 seconds of real time
> ;   0.002 seconds of user run time
>
> I was wrong, in this case CMUCL was 10 times slower than C++ version.
> ;-(
> But As I checked before it show performance equal to C/C++ on integer
> arithmetic.
>
> I'll check it again....
>
> Regards
> Lisper

I wrongly was looking on first number in CMUCL timing. Looks like CMUCL
was only 2 times slower on single-float summing.


* (set-array)

NIL
* (time (sum-array *array*))
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:

; Evaluation took:
;   0.00 seconds of real time
;   0.002 seconds of user run time
;   0.0 seconds of system run time
;   5,448,704 CPU cycles
;   0 page faults and
;   8 bytes consed.
;
884736.0

Let's compare performance on integers:

(defvar *array* (make-array 442368 :element-type 'fixnum
:initial-element 2))

(defun sum-array (array)
  (declare (type (simple-array fixnum (*)) array)
           (optimize (speed 3) (safety 0) (debug 0) (float 0)))
  (let ((isum 0)) (declare (type fixnum isum))
    (loop for k fixnum below 10 do
     (progn
       (setf isum 0)
       (loop for i fixnum below (the fixnum (length array)) do
             (incf isum (the fixnum (aref array i))))))
  isum))

We will sum elements of *array* ten times.

LispWorks:
Timing the evaluation of (SUM-ARRAY *ARRAY*)

user time    =      0.015
system time  =      0.000
Elapsed time =   0:00:00
Allocation   = 0 bytes standard / 0 bytes conses
0 Page faults

884736

CMUCL:
* (time (sum-array *array*))
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:

; Evaluation took:
;   0.01 seconds of real time
;   0.007999 seconds of user run time
;   0.0 seconds of system run time
;   20,653,528 CPU cycles
;   0 page faults and
;   0 bytes consed.
;
884736

C++ version:

int sum_elts(int *ar,int length)
{
    int i,k;
    int sum = 0;
    for (k=0;k<10;k++)
    {
       sum = 0;
       for (i=0;i<length;i++)
       {
           sum += ar[i];
       }
    }
    return sum;
}

int main()
{
    int i;
    LARGE_INTEGER freq,st,en;
    int *ar = new int[442368];
    int length = 442368;
    for (i=0;i<length;i++)
    {
        ar[i]=2;
    }
    QueryPerformanceFrequency(&freq);
    QueryPerformanceCounter(&st);
    int res = sum_elts(ar,length);
    QueryPerformanceCounter(&en);
    double time = ((double)(en.QuadPart - st.QuadPart)) /
freq.QuadPart;
    printf("%2.5lf sec. elapsed res: %d\n",time,res);

    delete [] ar;
    return 0;
}

0.00621 sec. elapsed res: 884736

CMUCL was 1.28 times slower than C++ (difference is so little, can be
timing noise).

LispWorks was 2.42 slower than C++ (it is definitely not noise).

As I was adviced to loop this test for 100 times or more to exclude
timing noise. I'll additionaly post that results later.

Regards
Lisper

From: Edi Weitz
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 20:36:17 +0000
Message-ID: <ull9py30e.fsf@agharta.de>

On 15 Feb 2005 12:22:54 -0800, ···········@mail.ru wrote:

> LispWorks:
> Timing the evaluation of (SUM-ARRAY *ARRAY*)
>
> user time    =      0.015
> system time  =      0.000
> Elapsed time =   0:00:00
> Allocation   = 0 bytes standard / 0 bytes conses
> 0 Page faults

Try adding the declaration

  (hcl:fixnum-safety 0)

to your code.

You might also want to have a look at

  <http://www.lispworks.com/documentation/lw44/LWUG/html/lwuser-92.htm>

and in fact the whole chapter 9 of the LispWorks User Guide.

Cheers,
Edi.

From: ···········@mail.ru
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 21:51:44 +0000
Message-ID: <1108504304.367738.311200@z14g2000cwz.googlegroups.com>

Thank you! This is right what I need to read about LispWorks code
optimization.

From: Brian Downing
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 19:47:46 +0000
Message-ID: <BlsQd.5773$4q6.1827@attbi_s01>

In article <························@g14g2000cwa.googlegroups.com>,
 <···········@mail.ru> wrote:
> ; Evaluation took:
> ;   0.01 seconds of real time
> ;   0.002 seconds of user run time
> 
> I was wrong, in this case CMUCL was 10 times slower than C++ version.
> ;-(
> But As I checked before it show performance equal to C/C++ on integer
> arithmetic.
> 
> I'll check it again....

Your times are so small that they're way down in the noise for most
operating systems' timers.  Try looping your benchmark a couple thousand
times and compare that instead!

(Incidentally I just did your example (x1000) here on CMUCL 18e and g++
3.1 (-O3), and both perform about the same.)

-bcd
-- 
*** Brian Downing <bdowning at lavos dot net>

From: ···········@mail.ru
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 21:41:47 +0000
Message-ID: <1108503707.325838.113710@z14g2000cwz.googlegroups.com>

Brian Downing wrote:
> In article <························@g14g2000cwa.googlegroups.com>,
>  <···········@mail.ru> wrote:
> > ; Evaluation took:
> > ;   0.01 seconds of real time
> > ;   0.002 seconds of user run time
> >
> > I was wrong, in this case CMUCL was 10 times slower than C++
version.
> > ;-(
> > But As I checked before it show performance equal to C/C++ on
integer
> > arithmetic.
> >
> > I'll check it again....
>
> Your times are so small that they're way down in the noise for most
> operating systems' timers.  Try looping your benchmark a couple
thousand
> times and compare that instead!
>
> (Incidentally I just did your example (x1000) here on CMUCL 18e and
g++
> 3.1 (-O3), and both perform about the same.)
>
> -bcd
> --
> *** Brian Downing <bdowning at lavos dot net>

You are right. I did (x10000) on integer summing and found that
LispWorks was only 1.3 times slower

than same C++ version and shows time equal to CMUCL and C++ on summing
corresponding elements of two

arrays and writing result to second array and CMUCL shows even slightly
shorter time on both tasks

than C++.

But on single float summing (x10000 too) LispWorks still at least ~5.3
times slower than C++ ;-(
and CMUCL is still 2 times slower (I think it is good).

I'd say LispWorks is turn out to be fast enough to rewrite some image
processing C++ functions on

it. We use a lot integer arithmetics and much less float.

Sorry to bother you with the topic. I've showed same test to XAnalys
team (that code was 1.9 slower

than optimized C version) and they couldn't make it faster. Our tasks
not allow us to be more than

1.3 times slower than optimized C, otherwise we will not fit in 40 ms.
interval between video 

frames.

Thanks to all
Lisper

From: Marc Battyani
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 21:55:52 +0000
Message-ID: <cutr4t$ac2@library2.airnews.net>

<···········@mail.ru> wrote :
> Marc Battyani wrote:
> > Maybe because they already generate very good code ?

[C++ code snipped]

> May be "very good" is not enough ? OK. Let's evaluate performance:
> 0.00096 seconds elapsed result: 884736.000000
>
> So C++ code is about 15 times faster than LispWorks version.

It's not what I have.
(As LWW works better on double-float, I switched to double floats for C++
and Lisp)

As we are discussing windows, I tried your C++ version with MSVC 7.0 (cl /Ox
test.cpp):
0.00334 seconds elapsed result: 884736.000000

Now I tried with LWW 4.4 (on 1000 iterations as the LWW time function is not
very precise)

(setf *array* (make-array 442368 :element-type 'double-float
:initial-element 2.0d0) a 0)

(defun sum-array (array)
  (declare (type (simple-array double-float (*)) array)
           (optimize (speed 3) (safety 0) (debug 0) (float 0)))
  (let ((sum 0.0)) (declare (type double-float sum))
     (loop for i fixnum below (the fixnum (length array)) do
       (setf sum (the double-float (+ sum (the double-float(aref array
i))))))
  sum))

Note that I didn't use incf...

(compile *)

CL-USER 102 > (time (loop repeat 1000 do (sum-array *array*)))
Timing the evaluation of (loop repeat 1000 do (sum-array *array*))

user time    =      2.143
system time  =      0.000

So on my Windows PC the LWW version is 1.56 times faster than MSVC 7.0

(disassemble 'sum-array)
...
L2:   89:      DD45F8           fldl  [ebp-8]
      92:      89FB             move  ebx, edi
      94:      C1FB05           sar   ebx, 5
      97:      DD441814         fldl  [eax+14+ebx]
     101:      DEC1             faddp st(1), st
     103:      DD5DF8           fstpl [ebp-8]
     106:      81C700010000     add   edi, 100
     112:      3BFA             cmp   edi, edx
     114:      7DAB             jge   L1
     116:      EBE3             jmp   L2

I still find this not too bad. Of course, it would be better to make the
conditional test go back to the loop start and avoid to store/reload the
sum.

Now I switched back to single-float both on the C++ and Lisp and here I have
almost the opposite:

MSVC7:
0.00228 seconds elapsed result: 884736.000000

LWW :
CL-USER 107 > (time (loop repeat 1000 do (sum-array *array*)))
Timing the evaluation of (loop repeat 1000 do (sum-array *array*))

user time    =      3.114
system time  =      0.000

So here the MSVC version is 1.4 times faster than the LWW.

So on my PC (Pentium M 2.0GHz),  LWW4.4 is 1.56 times faster on double
floats and 1.4 times slower on single floats than MSVC7.

Now if we want to play with the SSE2 instruction:
With double floats there is no speed improvement with MSVC7 (strange...)
With single floats, MSVC7 goes from 0.00334 s downto 0.00084 s

Conclusion: Lisp is as fast as C++ for normal floating point code but SSE2
instructions are really cool and you should ask your Lisp vendors to use
them in their compilers.

OK, now back to work...

Marc

From: ···········@mail.ru
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Wed, 16 Feb 2005 00:23:25 +0000
Message-ID: <1108513404.993236.146890@f14g2000cwb.googlegroups.com>

Marc Battyani wrote:
> <···········@mail.ru> wrote :
> > Marc Battyani wrote:
> > > Maybe because they already generate very good code ?
>
> [C++ code snipped]
>
> > May be "very good" is not enough ? OK. Let's evaluate performance:
> > 0.00096 seconds elapsed result: 884736.000000
> >
> > So C++ code is about 15 times faster than LispWorks version.
>
> It's not what I have.
> (As LWW works better on double-float, I switched to double floats for
C++
> and Lisp)
>
> As we are discussing windows, I tried your C++ version with MSVC 7.0
(cl /Ox
> test.cpp):
> 0.00334 seconds elapsed result: 884736.000000
>
> Now I tried with LWW 4.4 (on 1000 iterations as the LWW time function
is not
> very precise)
>
> (setf *array* (make-array 442368 :element-type 'double-float
> :initial-element 2.0d0) a 0)
>
> (defun sum-array (array)
>   (declare (type (simple-array double-float (*)) array)
>            (optimize (speed 3) (safety 0) (debug 0) (float 0)))
>   (let ((sum 0.0)) (declare (type double-float sum))
>      (loop for i fixnum below (the fixnum (length array)) do
>        (setf sum (the double-float (+ sum (the double-float(aref
array
> i))))))
>   sum))
>
> Note that I didn't use incf...
>
> (compile *)
>
> CL-USER 102 > (time (loop repeat 1000 do (sum-array *array*)))
> Timing the evaluation of (loop repeat 1000 do (sum-array *array*))
>
> user time    =      2.143
> system time  =      0.000

I took exactly the piece of code above.
My time on LWW 4.4 (Pentium 4-C 2.6Ghz) is  4.843 for double-float and
4.796 for single-float version.

And VC++ 6.0 takes 1.13 seconds for double float and 0.9387 for single
float version.

Looks like your Pentium M 2.0GHz is faster than Pentium 4-C 2.6Ghz.

I don't know why LWW is about 4 times slower in this case. ;-(

C++ version was:
double sum_elts(double *ar,int length)
{
    int i,j,k;
    double sum = 0.0;
    for (i=0;i<length;i++)
    {
          sum += ar[i];
    }
    return sum;
}

int main()
{
    int i,k;
    LARGE_INTEGER freq,st,en;
    double *ar = new double[442368];
    int length = 442368;
    for (i=0;i<length;i++)
    {
        ar[i]=2.0;
    }
    double res = 0.0;
    QueryPerformanceFrequency(&freq);
    QueryPerformanceCounter(&st);
    for (k=0;k<1000;k++)
    {
	res = sum_elts(ar,length);
    }
    QueryPerformanceCounter(&en);
    double time = ((double)(en.QuadPart - st.QuadPart)) /
freq.QuadPart;
    printf("%2.5lf sec. elapsed res: %lf\n",time,res);

    delete [] ar;
    return 0;
}

Compiled with /Ox option.
Replace all "double" occurrences to "float" to make single float
version.

> So on my Windows PC the LWW version is 1.56 times faster than MSVC
7.0
>
> (disassemble 'sum-array)
> ...
> L2:   89:      DD45F8           fldl  [ebp-8]
>       92:      89FB             move  ebx, edi
>       94:      C1FB05           sar   ebx, 5
>       97:      DD441814         fldl  [eax+14+ebx]
>      101:      DEC1             faddp st(1), st
>      103:      DD5DF8           fstpl [ebp-8]
>      106:      81C700010000     add   edi, 100
>      112:      3BFA             cmp   edi, edx
>      114:      7DAB             jge   L1
>      116:      EBE3             jmp   L2
>
> I still find this not too bad. Of course, it would be better to make
the
> conditional test go back to the loop start and avoid to store/reload
the
> sum.

I've exactly same assembly code.

> Conclusion: Lisp is as fast as C++ for normal floating point code but
SSE2
> instructions are really cool and you should ask your Lisp vendors to
use
> them in their compilers.

Conclusion: Pentium-4C is strongly biased towards fast execution of
C/C++ code 
;-))

Thanks for nice review
Lisper

From: Edi Weitz
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Wed, 16 Feb 2005 07:07:47 +0000
Message-ID: <ubralx9rw.fsf@agharta.de>

On 15 Feb 2005 16:23:25 -0800, ···········@mail.ru wrote:

> Looks like your Pentium M 2.0GHz is faster than Pentium 4-C 2.6Ghz.

Most likely.  Intel's marketing department is trying desperately to
let us know that clock rates aren't important anymore.  Kind of hard
because they told us the exact opposite for years... :)

Cheers,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: Paul F. Dietz
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Thu, 17 Feb 2005 13:03:00 +0000
Message-ID: <p5mdnS_geJCZC4nfRVn-vA@dls.net>

Edi Weitz wrote:

> Most likely.  Intel's marketing department is trying desperately to
> let us know that clock rates aren't important anymore.  Kind of hard
> because they told us the exact opposite for years... :)

Some of us didn't believe them. :)

	Paul

From: ···········@mail.ru
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Wed, 16 Feb 2005 01:03:06 +0000
Message-ID: <1108515786.779155.243190@l41g2000cwc.googlegroups.com>

Marc Battyani wrote:
> Conclusion: Lisp is as fast as C++ for normal floating point code but
SSE2
> instructions are really cool and you should ask your Lisp vendors to
use
> them in their compilers.
>
> OK, now back to work...
>
> Marc

You were totaly right.
I've just checked VC++ 7.1 and found that code without SSE2 was slower
than LispWorks version and about 4 times faster with SSE2 turned on.

Is it hard to make support for these instructions for vendors ? It
would be cool to demonstrate Lisp code which as fast as fully optimized
C/C++ code. My co-workers would change their mind towards Lisp fast ;-)

Regards
Lisper

From: Wade Humeniuk
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 16:07:19 +0000
Message-ID: <X6pQd.44030$K54.14041@edtnps84>

Marc Battyani wrote:

> 
> Does not look very slow.
> OK those declaration are not pretty but you only need them in some parts.
> 

Well it does look slow to me.  With vector based CPU instructions that
routine could be reduced to one instruction.  On the vector based
machines I was familiar with (the old CYBER 205) there was a single
machine instruction to multiply a vector by a scalar.  I assume the
newer x86 instructions provide something similar.  Perhaps if
the Lisp vendors provided an API to embed machine instructions
directly and allow some low level access to get at the internal
representation of vectors, then this would all be academic.

Wade

From: Wade Humeniuk
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 16:14:26 +0000
Message-ID: <CdpQd.44033$K54.3686@edtnps84>

Then the function could look something like

(defun multiply-array-by-scalar4 (array scalar)
   (declare (type (simple-array single-float (*)) array)
            (type single-float scalar))
   (embed-x86-assembler
    (x86:fsmultvs (load-register (address array)) (load-register scalar))))

(with the appropriate x86 (is it a SSE instruction?) CPU code.)

Wade

From: Gisle SÃ¦lensminde
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 17:23:59 +0000
Message-ID: <slrnd14c1f.f41.gisle@kaktus.ii.uib.no>

On 2005-02-15, Wade Humeniuk <··················@telus.net> wrote:
> Marc Battyani wrote:
>
>> 
>> Does not look very slow.
>> OK those declaration are not pretty but you only need them in some parts.
>> 
>
> Well it does look slow to me.  With vector based CPU instructions that
> routine could be reduced to one instruction.  On the vector based
> machines I was familiar with (the old CYBER 205) there was a single
> machine instruction to multiply a vector by a scalar.  I assume the
> newer x86 instructions provide something similar.  Perhaps if
> the Lisp vendors provided an API to embed machine instructions
> directly and allow some low level access to get at the internal
> representation of vectors, then this would all be academic.

The OP compared to C++. As far as I know, few C++ compilers on x86 emits
MMX or SSE instructions, much because these instructions is very hard to 
use in a generic way. For that reason plus backward compatibility requirements,
most executables on x86 runs on the plain pentium subset of instructions. This seems
to change with the x86-64, that has much more generic instructions. 

My impression is the same as Marc, that allegro with declarations is quite fast, 
but not as fast as cmucl/sbcl or c++ for that sake, but then I usually try to 
write code fast rather than fast code.

-- 
Gisle S�lensminde, Phd student, Scientific programmer
Computational biology unit, University of Bergen, Norway
Email: ·····@cbu.uib.no | Complicated is easy, simple is hard.

From: Carl Shapiro
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 21:19:19 +0000
Message-ID: <ouyu0odttbc.fsf@panix3.panix.com>

Gisle Sælensminde <·····@kaktus.ii.uib.no> writes:

> The OP compared to C++. As far as I know, few C++ compilers on x86 emits
> MMX or SSE instructions, much because these instructions is very hard to 
> use in a generic way. 

This is simply untrue.  The Microsoft Visual C++ compiler will emit
SSE instructions as will the Intel C++ compiler.  When using those
compilers, you do not need to do anything funky to use the SSE unit.
Plus, both of these compilers support the Intel SSE intrinsics, so you
can do such things as ask for blocks of memory allocated along the
natural boundaries required for invoking SSE instructions at full
speed, as well as invoke specific SSE operations without dropping down
to assembler.

                        For that reason plus backward compatibility 
> requirements,most executables on x86 runs on the plain pentium subset 
> of instructions.

Not true.  You can ask your compiler separate function definitions for
whatever flavors of SSE and x87.  The runtime linker will bind the
most specific definitions for your processor at load time.

From: Wade Humeniuk
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 17:52:04 +0000
Message-ID: <8FqQd.61930$L_3.27342@clgrps13>

Gisle S�lensminde wrote:

> The OP compared to C++. As far as I know, few C++ compilers on x86 emits
> MMX or SSE instructions, much because these instructions is very hard to 
> use in a generic way. For that reason plus backward compatibility requirements,
> most executables on x86 runs on the plain pentium subset of instructions. This seems
> to change with the x86-64, that has much more generic instructions. 
> 

But if the OP is really interested in SPEED then he would go for
the MMX/SSE/SSE2 instructions.  Which means chosing a C++
compiler that supports that.

But like I said if the Lisp vendors
had embedded-instruction capabilities then us programmers could decide
when to use it (in this case restrict one's app to
Pentium III+).  Then maybe the whining would die down to
a small whisper.  In the extreme, Cliki could have
libraries of all kinds of useful low-level numeric routines
for various CPUs.  Then only consenting-adults need use them.

Wade

From: Bulent Murtezaoglu
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 18:23:15 +0000
Message-ID: <874qgd1y3w.fsf@p4.internal>

>>>>> "WH" == Wade Humeniuk <··················@telus.net> writes:
[...]
    WH> But like I said if the Lisp vendors had embedded-instruction
    WH> capabilities then us programmers could decide when to use it
    WH> (in this case restrict one's app to Pentium III+).  [...]

I am pretty sure Corman does offer this.  A functionally similar 
facility exists and is rougly documented for sbcl:

http://sourceforge.net/mailarchive/forum.php?thread_id=177673&forum_id=4133

linked from

http://sbcl-internals.cliki.net/VOP

cheers,

BM

From: Wade Humeniuk
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 19:22:08 +0000
Message-ID: <AZrQd.61945$L_3.22135@clgrps13>

It appears that LW could potentially have it, though it is not documented.
See the PC86 package and SYSTEM:DEFASM, etc... (no
symbols are exported from the package).

In LW you can use my CAPI package-browser to peruse a package.

http://www3.telus.net/public/whumeniu/package-browser.lisp

Wade

From: Wade Humeniuk
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 19:24:45 +0000
Message-ID: <10sQd.61946$L_3.12727@clgrps13>

Wade Humeniuk wrote:
> It appears that LW could potentially have it, though it is not documented.
> See the PC86 package and SYSTEM:DEFASM, etc... (no
> symbols are exported from the package).
> 

oops, that would be pc386 package.

Wade

From: Edi Weitz
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 19:03:01 +0000
Message-ID: <uekfhzlwa.fsf@agharta.de>

On Tue, 15 Feb 2005 20:23:15 +0200, Bulent Murtezaoglu <··@acm.org> wrote:

> I am pretty sure Corman does offer this.

Yes, see chapter 9 of

  <http://www.cormanlisp.com/CormanLisp/CormanLisp_2_5.pdf>.

Of course this is easier for CCL than for other Lisps as most of the
other CL implementations (with the expection of MCL/OpenMCL) target
more than one processor family.

Cheers,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: Harald Hanche-Olsen
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 20:27:38 +0000
Message-ID: <pcozmy5zhz9.fsf@shuttle.math.ntnu.no>

+ Wade Humeniuk <··················@telus.net>:

| Well it does look slow to me.  With vector based CPU instructions that
| routine could be reduced to one instruction.  On the vector based
| machines I was familiar with (the old CYBER 205) there was a single
| machine instruction to multiply a vector by a scalar.  I assume the
| newer x86 instructions provide something similar.  Perhaps if
| the Lisp vendors provided an API to embed machine instructions
| directly and allow some low level access to get at the internal
| representation of vectors, then this would all be academic.

Or one could venture into FFI land and use BLAS routines properly
optimized for the CPU in question?

  http://www.netlib.org/blas/faq.html

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- Debating gives most of us much more psychological satisfaction
  than thinking does: but it deprives us of whatever chance there is
  of getting closer to the truth.  -- C.P. Snow

From: Bulent Murtezaoglu
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 21:02:00 +0000
Message-ID: <87zmy5zgdz.fsf@p4.internal>

>>>>> "HHO" == Harald Hanche-Olsen <······@math.ntnu.no> writes:
[...]
    HHO> Or one could venture into FFI land and use BLAS routines
    HHO> properly optimized for the CPU in question?

    HHO>   http://www.netlib.org/blas/faq.html

Or, even better, highly tuned ones:

http://math-atlas.sourceforge.net/

Matlisp (I believe) can use Atlas:

http://matlisp.sourceforge.net/

BM

From: rif
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 22:10:00 +0000
Message-ID: <wj0vf8tlbk7.fsf@five-percent-nation.mit.edu>

Yes it can.  I do this all the time.
Of course, Matlisp only works on CMUCL or Allegro AFAIK.

rif

From: Raymond Toy
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Wed, 16 Feb 2005 16:29:34 +0000
Message-ID: <sxdwtt8ii35.fsf@rtp.ericsson.se>

>>>>> "rif" == rif  <···@mit.edu> writes:

    rif> Yes it can.  I do this all the time.
    rif> Of course, Matlisp only works on CMUCL or Allegro AFAIK.

I think sbcl should work as well.  

Ray

From: Camm Maguire
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Sat, 19 Feb 2005 21:48:15 +0000
Message-ID: <54650o5ihs.fsf@intech19.enhanced.com>

Greetings!  GCL is pretty competitive with CMUCL performance-wise in
my experience, and is supported on Windows.

Take care,

Wade Humeniuk <··················@telus.net> writes:

> Marc Battyani wrote:
> 
> > Does not look very slow.
> > OK those declaration are not pretty but you only need them in some parts.
> >
> 
> Well it does look slow to me.  With vector based CPU instructions that
> routine could be reduced to one instruction.  On the vector based
> machines I was familiar with (the old CYBER 205) there was a single
> machine instruction to multiply a vector by a scalar.  I assume the
> newer x86 instructions provide something similar.  Perhaps if
> the Lisp vendors provided an API to embed machine instructions
> directly and allow some low level access to get at the internal
> representation of vectors, then this would all be academic.
> 
> Wade
> 

-- 
Camm Maguire			     			····@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah

From: Bulent Murtezaoglu
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 14:31:45 +0000
Message-ID: <878y5p28tq.fsf@p4.internal>

>>>>> "MB" == Marc Battyani <·············@fractalconcept.com> writes:
[...]
    MB> C++ is slow anyway. If you really want speed then program in
    MB> VHDL or Verilog.

Do you have a socket library to interface to programs written in those?

BM

From: Marc Battyani
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date: Tue, 15 Feb 2005 14:53:52 +0000
Message-ID: <cut2di$ukb@library2.airnews.net>

"Bulent Murtezaoglu" <··@acm.org> wrote
> >>>>> "MB" == Marc Battyani <·············@fractalconcept.com> writes:
> [...]
>     MB> C++ is slow anyway. If you really want speed then program in
>     MB> VHDL or Verilog.
>
> Do you have a socket library to interface to programs written in those?

Yes, I'm currently designing a small board (with an FPGA, some Flash and
SDRAM memory, and an Ethernet 100 PHY) where I do exactly this. The fast
computation is done in VHDL and the higher level stuff and socket
communication is done in C (I'm even looking for a small Lisp, see my last
week posts here) on a soft-processor embedded in the FPGA and running
uClinux. That's very cool ;-)

Here is an example of this kind of system with uClinux in a FPGA:
http://www.altera.com/products/devkits/altera/kit-nios_eval_1C12.html

BTW I'm still looking for a Common Lisp to run on this. So far the best
solutions I have found are OpenLisp (a nice ISLisp implementation, but not
multi-threaded) and Chicken (a Scheme).

Marc