Hi,
Since found Common Lisp I learned it from time to time. However I have
to use C/C++ in daily basis. I've started to promote Common Lisp and
didn't encounter any resistance from my boss. You can understand what
difficulties I met using Common Lisp for the first application (it was
not about Lisp but more likely becase C/C++ habits).
Today I know much more than years ago. This year my co-workers started
to escape from C++ to C# and pretty happy with it. Since .NET runtime
is not smaller than any Lisp image, today I don't think much about my
applications size. But the rest is SPEED.
Question is... I saw CMUCL and found it's performance on image
processing could be equal to optimized C code. But there is no such
thing on Windows. Why ? Allegro CL, LispWorks, Corman Lisp - are too
slow in comparison with optimized C/C++ code. I've heard some arguments
about it (for example messages from Duane Rettig about code
optimization on modern CPUs) , but still wonder why can not commercial
vendors (even Franz) improve code generation to match performance with
competitors? I would rewrite my C/C++ DLLs in Common Lisp and wouldn't
use FFI so much. I belive it is possible (some day).
Regards
Lisper
<···········@mail.ru> wrote
> thing on Windows. Why ? Allegro CL, LispWorks, Corman Lisp - are too
> slow in comparison with optimized C/C++ code. I've heard some arguments
> about it (for example messages from Duane Rettig about code
> optimization on modern CPUs) , but still wonder why can not commercial
> vendors (even Franz) improve code generation to match performance with
> competitors?
Maybe because they already generate very good code ?
This topic is a recurrent one here so search with google you will find lots
of examples.
Here is one from a recent discussion on the Lispworks mailing list:
(defun multiply-array-by-scalar4 (array scalar)
(declare (type (simple-array single-float (*)) array)
(type single-float scalar)
(optimize (speed 3) (safety 0) (debug 0) (float 0)))
(loop for i fixnum below (the fixnum (length array)) do
(setf (aref array i) (* scalar (aref array i)))))
LWW:
0: 55 push ebp
1: 89E5 move ebp, esp
3: 83EC18 sub esp, 18
6: C7042445180000 move [esp], 1845
13: 8B5D08 move ebx, [ebp+8]
16: DD4004 fldl [eax+4]
19: DD5DEC fstpl [ebp-14]
22: 8B430C move eax, [ebx+C]
25: 33FF xor edi, edi
27: 3BF8 cmp edi, eax
29: 7C0A jl L2
L1: 31: B810000000 move eax, 10
36: FD std
37: C9 leave
38: C20400 ret 4
L2: 41: 89FA move edx, edi
43: C1FA06 sar edx, 6
46: D9441310 flds [ebx+10+edx]
50: DC4DEC fmull [ebp-14]
53: 89FA move edx, edi
55: C1FA06 sar edx, 6
58: D95C1310 fstps [ebx+10+edx]
62: 81C700010000 add edi, 100
68: 3BF8 cmp edi, eax
70: 7DD7 jge L1
72: EBDF jmp L2
ACL 7.0:
0: d9 42 f2 fldf [edx-14]
3: dd da fstp st(2)
5: 8b 58 f2 movl ebx,[eax-14]
8: 33 d2 xorl edx,edx
10: eb 1b jmp 39
12: d9 44 10 f6 fldf [eax+edx-10]
16: dd db fstp st(3)
18: d9 af 4b fd fldcwf [edi-693] ; SYS::SINGLE_CONVERTER
ff ff
24: d9 c2 fld st,st(2)
26: d8 ca fmul st,st(2)
28: dd db fstp st(3)
30: d9 c2 fld st,st(2)
32: d9 5c 10 f6 fstpf [eax+edx-10]
36: 83 c2 04 addl edx,$4
39: 3b d3 cmpl edx,ebx
41: 7c e1 jl 12
43: 8b c7 movl eax,edi
45: f8 clc
46: 8b 75 fc movl esi,[ebp-4]
49: c3 ret
Does not look very slow.
OK those declaration are not pretty but you only need them in some parts.
> I would rewrite my C/C++ DLLs in Common Lisp and wouldn't
> use FFI so much. I belive it is possible (some day).
C++ is slow anyway. If you really want speed then program in VHDL or
Verilog.
Marc
Marc Battyani wrote:
> <···········@mail.ru> wrote
> > thing on Windows. Why ? Allegro CL, LispWorks, Corman Lisp - are
too
> > slow in comparison with optimized C/C++ code. I've heard some
arguments
> > about it (for example messages from Duane Rettig about code
> > optimization on modern CPUs) , but still wonder why can not
commercial
> > vendors (even Franz) improve code generation to match performance
with
> > competitors?
>
> Maybe because they already generate very good code ?
> This topic is a recurrent one here so search with google you will
find lots
> of examples.
May be "very good" is not enough ? OK. Let's evaluate performance:
(setf *array* (make-array 442368 :element-type 'single-float
:initial-element 2.0))
(defun sum-array (array)
(declare (type (simple-array single-float (*)) array)
(optimize (speed 3) (safety 0) (debug 0) (float 0)))
(let ((sum 0.0)) (declare (type single-float sum))
(loop for i fixnum below (the fixnum (length array)) do
(incf sum (the single-float (aref array i))))
sum))
(time (sum-array *array*))
**********************************************
Timing the evaluation of (SUM-ARRAY *ARRAY*)
user time = 0.015
system time = 0.000
Elapsed time = 0:00:00
Allocation = 16 bytes standard / 0 bytes conses
0 Page faults
884736.0
**********************************************
Now C/C++:
float sum_elts(float *ar,int length)
{
int i,j,k;
float sum = 0.0;
for (i=0;i<length;i++)
{
sum += ar[i];
}
return sum;
}
int main()
{
int i;
LARGE_INTEGER freq,st,en;
float *ar = new float[442368];
int length = 442368;
for (i=0;i<length;i++)
{
ar[i]=2.0;
}
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&st);
float sum = sum_elts(ar,length);
QueryPerformanceCounter(&en);
double time = ((double)(en.QuadPart - st.QuadPart)) /
freq.QuadPart;
printf("%2.5lf seconds elapsed result: %lf\n",time,sum);
delete [] ar;
return 0;
}
0.00096 seconds elapsed result: 884736.000000
So C++ code is about 15 times faster than LispWorks version.
Now CMUCL:
* (time (sum-array *array*))
; [GC threshold exceeded with 30,693,280 bytes in use. Commencing GC.]
; [GC completed with 19,894,104 bytes retained and 10,799,176 bytes
freed.]
; [GC will next occur when at least 31,894,104 bytes are in use.]
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:
; Evaluation took:
; 0.01 seconds of real time
; 0.002 seconds of user run time
; 0.0 seconds of system run time
; 5,719,264 CPU cycles
; 0 page faults and
; 8 bytes consed.
;
884736.0
CMUCL code is as fast as C++ version. So, what do I do wrong ?
I had already tested same code on Allegro CL 7.0 and results were not
better than on LispWorks (LispWorks was even slightly faster than ACL).
> Does not look very slow.
> OK those declaration are not pretty but you only need them in some
parts.
I know and it is OK.
> C++ is slow anyway. If you really want speed then program in VHDL or
> Verilog.
I want program on Common Lisp solely and I see that CMUCL can be as
fast as C++ and even faster. But unfortunately I need program on MS
Windows ;-(
Regards
Lisper
; Evaluation took:
; 0.01 seconds of real time
; 0.002 seconds of user run time
I was wrong, in this case CMUCL was 10 times slower than C++ version.
;-(
But As I checked before it show performance equal to C/C++ on integer
arithmetic.
I'll check it again....
Regards
Lisper
···········@mail.ru wrote:
> ; Evaluation took:
> ; 0.01 seconds of real time
> ; 0.002 seconds of user run time
>
> I was wrong, in this case CMUCL was 10 times slower than C++ version.
> ;-(
> But As I checked before it show performance equal to C/C++ on integer
> arithmetic.
>
> I'll check it again....
>
> Regards
> Lisper
I wrongly was looking on first number in CMUCL timing. Looks like CMUCL
was only 2 times slower on single-float summing.
* (set-array)
NIL
* (time (sum-array *array*))
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:
; Evaluation took:
; 0.00 seconds of real time
; 0.002 seconds of user run time
; 0.0 seconds of system run time
; 5,448,704 CPU cycles
; 0 page faults and
; 8 bytes consed.
;
884736.0
Let's compare performance on integers:
(defvar *array* (make-array 442368 :element-type 'fixnum
:initial-element 2))
(defun sum-array (array)
(declare (type (simple-array fixnum (*)) array)
(optimize (speed 3) (safety 0) (debug 0) (float 0)))
(let ((isum 0)) (declare (type fixnum isum))
(loop for k fixnum below 10 do
(progn
(setf isum 0)
(loop for i fixnum below (the fixnum (length array)) do
(incf isum (the fixnum (aref array i))))))
isum))
We will sum elements of *array* ten times.
LispWorks:
Timing the evaluation of (SUM-ARRAY *ARRAY*)
user time = 0.015
system time = 0.000
Elapsed time = 0:00:00
Allocation = 0 bytes standard / 0 bytes conses
0 Page faults
884736
CMUCL:
* (time (sum-array *array*))
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:
; Evaluation took:
; 0.01 seconds of real time
; 0.007999 seconds of user run time
; 0.0 seconds of system run time
; 20,653,528 CPU cycles
; 0 page faults and
; 0 bytes consed.
;
884736
C++ version:
int sum_elts(int *ar,int length)
{
int i,k;
int sum = 0;
for (k=0;k<10;k++)
{
sum = 0;
for (i=0;i<length;i++)
{
sum += ar[i];
}
}
return sum;
}
int main()
{
int i;
LARGE_INTEGER freq,st,en;
int *ar = new int[442368];
int length = 442368;
for (i=0;i<length;i++)
{
ar[i]=2;
}
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&st);
int res = sum_elts(ar,length);
QueryPerformanceCounter(&en);
double time = ((double)(en.QuadPart - st.QuadPart)) /
freq.QuadPart;
printf("%2.5lf sec. elapsed res: %d\n",time,res);
delete [] ar;
return 0;
}
0.00621 sec. elapsed res: 884736
CMUCL was 1.28 times slower than C++ (difference is so little, can be
timing noise).
LispWorks was 2.42 slower than C++ (it is definitely not noise).
As I was adviced to loop this test for 100 times or more to exclude
timing noise. I'll additionaly post that results later.
Regards
Lisper
From: Edi Weitz
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date:
Message-ID: <ull9py30e.fsf@agharta.de>
On 15 Feb 2005 12:22:54 -0800, ···········@mail.ru wrote:
> LispWorks:
> Timing the evaluation of (SUM-ARRAY *ARRAY*)
>
> user time = 0.015
> system time = 0.000
> Elapsed time = 0:00:00
> Allocation = 0 bytes standard / 0 bytes conses
> 0 Page faults
Try adding the declaration
(hcl:fixnum-safety 0)
to your code.
You might also want to have a look at
<http://www.lispworks.com/documentation/lw44/LWUG/html/lwuser-92.htm>
and in fact the whole chapter 9 of the LispWorks User Guide.
Cheers,
Edi.
Thank you! This is right what I need to read about LispWorks code
optimization.
From: Brian Downing
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date:
Message-ID: <BlsQd.5773$4q6.1827@attbi_s01>
In article <························@g14g2000cwa.googlegroups.com>,
<···········@mail.ru> wrote:
> ; Evaluation took:
> ; 0.01 seconds of real time
> ; 0.002 seconds of user run time
>
> I was wrong, in this case CMUCL was 10 times slower than C++ version.
> ;-(
> But As I checked before it show performance equal to C/C++ on integer
> arithmetic.
>
> I'll check it again....
Your times are so small that they're way down in the noise for most
operating systems' timers. Try looping your benchmark a couple thousand
times and compare that instead!
(Incidentally I just did your example (x1000) here on CMUCL 18e and g++
3.1 (-O3), and both perform about the same.)
-bcd
--
*** Brian Downing <bdowning at lavos dot net>
Brian Downing wrote:
> In article <························@g14g2000cwa.googlegroups.com>,
> <···········@mail.ru> wrote:
> > ; Evaluation took:
> > ; 0.01 seconds of real time
> > ; 0.002 seconds of user run time
> >
> > I was wrong, in this case CMUCL was 10 times slower than C++
version.
> > ;-(
> > But As I checked before it show performance equal to C/C++ on
integer
> > arithmetic.
> >
> > I'll check it again....
>
> Your times are so small that they're way down in the noise for most
> operating systems' timers. Try looping your benchmark a couple
thousand
> times and compare that instead!
>
> (Incidentally I just did your example (x1000) here on CMUCL 18e and
g++
> 3.1 (-O3), and both perform about the same.)
>
> -bcd
> --
> *** Brian Downing <bdowning at lavos dot net>
You are right. I did (x10000) on integer summing and found that
LispWorks was only 1.3 times slower
than same C++ version and shows time equal to CMUCL and C++ on summing
corresponding elements of two
arrays and writing result to second array and CMUCL shows even slightly
shorter time on both tasks
than C++.
But on single float summing (x10000 too) LispWorks still at least ~5.3
times slower than C++ ;-(
and CMUCL is still 2 times slower (I think it is good).
I'd say LispWorks is turn out to be fast enough to rewrite some image
processing C++ functions on
it. We use a lot integer arithmetics and much less float.
Sorry to bother you with the topic. I've showed same test to XAnalys
team (that code was 1.9 slower
than optimized C version) and they couldn't make it faster. Our tasks
not allow us to be more than
1.3 times slower than optimized C, otherwise we will not fit in 40 ms.
interval between video
frames.
Thanks to all
Lisper
<···········@mail.ru> wrote :
> Marc Battyani wrote:
> > Maybe because they already generate very good code ?
[C++ code snipped]
> May be "very good" is not enough ? OK. Let's evaluate performance:
> 0.00096 seconds elapsed result: 884736.000000
>
> So C++ code is about 15 times faster than LispWorks version.
It's not what I have.
(As LWW works better on double-float, I switched to double floats for C++
and Lisp)
As we are discussing windows, I tried your C++ version with MSVC 7.0 (cl /Ox
test.cpp):
0.00334 seconds elapsed result: 884736.000000
Now I tried with LWW 4.4 (on 1000 iterations as the LWW time function is not
very precise)
(setf *array* (make-array 442368 :element-type 'double-float
:initial-element 2.0d0) a 0)
(defun sum-array (array)
(declare (type (simple-array double-float (*)) array)
(optimize (speed 3) (safety 0) (debug 0) (float 0)))
(let ((sum 0.0)) (declare (type double-float sum))
(loop for i fixnum below (the fixnum (length array)) do
(setf sum (the double-float (+ sum (the double-float(aref array
i))))))
sum))
Note that I didn't use incf...
(compile *)
CL-USER 102 > (time (loop repeat 1000 do (sum-array *array*)))
Timing the evaluation of (loop repeat 1000 do (sum-array *array*))
user time = 2.143
system time = 0.000
So on my Windows PC the LWW version is 1.56 times faster than MSVC 7.0
(disassemble 'sum-array)
...
L2: 89: DD45F8 fldl [ebp-8]
92: 89FB move ebx, edi
94: C1FB05 sar ebx, 5
97: DD441814 fldl [eax+14+ebx]
101: DEC1 faddp st(1), st
103: DD5DF8 fstpl [ebp-8]
106: 81C700010000 add edi, 100
112: 3BFA cmp edi, edx
114: 7DAB jge L1
116: EBE3 jmp L2
I still find this not too bad. Of course, it would be better to make the
conditional test go back to the loop start and avoid to store/reload the
sum.
Now I switched back to single-float both on the C++ and Lisp and here I have
almost the opposite:
MSVC7:
0.00228 seconds elapsed result: 884736.000000
LWW :
CL-USER 107 > (time (loop repeat 1000 do (sum-array *array*)))
Timing the evaluation of (loop repeat 1000 do (sum-array *array*))
user time = 3.114
system time = 0.000
So here the MSVC version is 1.4 times faster than the LWW.
So on my PC (Pentium M 2.0GHz), LWW4.4 is 1.56 times faster on double
floats and 1.4 times slower on single floats than MSVC7.
Now if we want to play with the SSE2 instruction:
With double floats there is no speed improvement with MSVC7 (strange...)
With single floats, MSVC7 goes from 0.00334 s downto 0.00084 s
Conclusion: Lisp is as fast as C++ for normal floating point code but SSE2
instructions are really cool and you should ask your Lisp vendors to use
them in their compilers.
OK, now back to work...
Marc
Marc Battyani wrote:
> <···········@mail.ru> wrote :
> > Marc Battyani wrote:
> > > Maybe because they already generate very good code ?
>
> [C++ code snipped]
>
> > May be "very good" is not enough ? OK. Let's evaluate performance:
> > 0.00096 seconds elapsed result: 884736.000000
> >
> > So C++ code is about 15 times faster than LispWorks version.
>
> It's not what I have.
> (As LWW works better on double-float, I switched to double floats for
C++
> and Lisp)
>
> As we are discussing windows, I tried your C++ version with MSVC 7.0
(cl /Ox
> test.cpp):
> 0.00334 seconds elapsed result: 884736.000000
>
> Now I tried with LWW 4.4 (on 1000 iterations as the LWW time function
is not
> very precise)
>
> (setf *array* (make-array 442368 :element-type 'double-float
> :initial-element 2.0d0) a 0)
>
> (defun sum-array (array)
> (declare (type (simple-array double-float (*)) array)
> (optimize (speed 3) (safety 0) (debug 0) (float 0)))
> (let ((sum 0.0)) (declare (type double-float sum))
> (loop for i fixnum below (the fixnum (length array)) do
> (setf sum (the double-float (+ sum (the double-float(aref
array
> i))))))
> sum))
>
> Note that I didn't use incf...
>
> (compile *)
>
> CL-USER 102 > (time (loop repeat 1000 do (sum-array *array*)))
> Timing the evaluation of (loop repeat 1000 do (sum-array *array*))
>
> user time = 2.143
> system time = 0.000
I took exactly the piece of code above.
My time on LWW 4.4 (Pentium 4-C 2.6Ghz) is 4.843 for double-float and
4.796 for single-float version.
And VC++ 6.0 takes 1.13 seconds for double float and 0.9387 for single
float version.
Looks like your Pentium M 2.0GHz is faster than Pentium 4-C 2.6Ghz.
I don't know why LWW is about 4 times slower in this case. ;-(
C++ version was:
double sum_elts(double *ar,int length)
{
int i,j,k;
double sum = 0.0;
for (i=0;i<length;i++)
{
sum += ar[i];
}
return sum;
}
int main()
{
int i,k;
LARGE_INTEGER freq,st,en;
double *ar = new double[442368];
int length = 442368;
for (i=0;i<length;i++)
{
ar[i]=2.0;
}
double res = 0.0;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&st);
for (k=0;k<1000;k++)
{
res = sum_elts(ar,length);
}
QueryPerformanceCounter(&en);
double time = ((double)(en.QuadPart - st.QuadPart)) /
freq.QuadPart;
printf("%2.5lf sec. elapsed res: %lf\n",time,res);
delete [] ar;
return 0;
}
Compiled with /Ox option.
Replace all "double" occurrences to "float" to make single float
version.
> So on my Windows PC the LWW version is 1.56 times faster than MSVC
7.0
>
> (disassemble 'sum-array)
> ...
> L2: 89: DD45F8 fldl [ebp-8]
> 92: 89FB move ebx, edi
> 94: C1FB05 sar ebx, 5
> 97: DD441814 fldl [eax+14+ebx]
> 101: DEC1 faddp st(1), st
> 103: DD5DF8 fstpl [ebp-8]
> 106: 81C700010000 add edi, 100
> 112: 3BFA cmp edi, edx
> 114: 7DAB jge L1
> 116: EBE3 jmp L2
>
> I still find this not too bad. Of course, it would be better to make
the
> conditional test go back to the loop start and avoid to store/reload
the
> sum.
I've exactly same assembly code.
> Conclusion: Lisp is as fast as C++ for normal floating point code but
SSE2
> instructions are really cool and you should ask your Lisp vendors to
use
> them in their compilers.
Conclusion: Pentium-4C is strongly biased towards fast execution of
C/C++ code
;-))
Thanks for nice review
Lisper
From: Edi Weitz
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date:
Message-ID: <ubralx9rw.fsf@agharta.de>
On 15 Feb 2005 16:23:25 -0800, ···········@mail.ru wrote:
> Looks like your Pentium M 2.0GHz is faster than Pentium 4-C 2.6Ghz.
Most likely. Intel's marketing department is trying desperately to
let us know that clock rates aren't important anymore. Kind of hard
because they told us the exact opposite for years... :)
Cheers,
Edi.
--
Lisp is not dead, it just smells funny.
Real email: (replace (subseq ·········@agharta.de" 5) "edi")
From: Paul F. Dietz
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date:
Message-ID: <p5mdnS_geJCZC4nfRVn-vA@dls.net>
Edi Weitz wrote:
> Most likely. Intel's marketing department is trying desperately to
> let us know that clock rates aren't important anymore. Kind of hard
> because they told us the exact opposite for years... :)
Some of us didn't believe them. :)
Paul
Marc Battyani wrote:
> Conclusion: Lisp is as fast as C++ for normal floating point code but
SSE2
> instructions are really cool and you should ask your Lisp vendors to
use
> them in their compilers.
>
> OK, now back to work...
>
> Marc
You were totaly right.
I've just checked VC++ 7.1 and found that code without SSE2 was slower
than LispWorks version and about 4 times faster with SSE2 turned on.
Is it hard to make support for these instructions for vendors ? It
would be cool to demonstrate Lisp code which as fast as fully optimized
C/C++ code. My co-workers would change their mind towards Lisp fast ;-)
Regards
Lisper
From: Wade Humeniuk
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date:
Message-ID: <X6pQd.44030$K54.14041@edtnps84>
Marc Battyani wrote:
>
> Does not look very slow.
> OK those declaration are not pretty but you only need them in some parts.
>
Well it does look slow to me. With vector based CPU instructions that
routine could be reduced to one instruction. On the vector based
machines I was familiar with (the old CYBER 205) there was a single
machine instruction to multiply a vector by a scalar. I assume the
newer x86 instructions provide something similar. Perhaps if
the Lisp vendors provided an API to embed machine instructions
directly and allow some low level access to get at the internal
representation of vectors, then this would all be academic.
Wade
From: Wade Humeniuk
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date:
Message-ID: <CdpQd.44033$K54.3686@edtnps84>
Then the function could look something like
(defun multiply-array-by-scalar4 (array scalar)
(declare (type (simple-array single-float (*)) array)
(type single-float scalar))
(embed-x86-assembler
(x86:fsmultvs (load-register (address array)) (load-register scalar))))
(with the appropriate x86 (is it a SSE instruction?) CPU code.)
Wade
On 2005-02-15, Wade Humeniuk <··················@telus.net> wrote:
> Marc Battyani wrote:
>
>>
>> Does not look very slow.
>> OK those declaration are not pretty but you only need them in some parts.
>>
>
> Well it does look slow to me. With vector based CPU instructions that
> routine could be reduced to one instruction. On the vector based
> machines I was familiar with (the old CYBER 205) there was a single
> machine instruction to multiply a vector by a scalar. I assume the
> newer x86 instructions provide something similar. Perhaps if
> the Lisp vendors provided an API to embed machine instructions
> directly and allow some low level access to get at the internal
> representation of vectors, then this would all be academic.
The OP compared to C++. As far as I know, few C++ compilers on x86 emits
MMX or SSE instructions, much because these instructions is very hard to
use in a generic way. For that reason plus backward compatibility requirements,
most executables on x86 runs on the plain pentium subset of instructions. This seems
to change with the x86-64, that has much more generic instructions.
My impression is the same as Marc, that allegro with declarations is quite fast,
but not as fast as cmucl/sbcl or c++ for that sake, but then I usually try to
write code fast rather than fast code.
--
Gisle S�lensminde, Phd student, Scientific programmer
Computational biology unit, University of Bergen, Norway
Email: ·····@cbu.uib.no | Complicated is easy, simple is hard.
Gisle Sælensminde <·····@kaktus.ii.uib.no> writes:
> The OP compared to C++. As far as I know, few C++ compilers on x86 emits
> MMX or SSE instructions, much because these instructions is very hard to
> use in a generic way.
This is simply untrue. The Microsoft Visual C++ compiler will emit
SSE instructions as will the Intel C++ compiler. When using those
compilers, you do not need to do anything funky to use the SSE unit.
Plus, both of these compilers support the Intel SSE intrinsics, so you
can do such things as ask for blocks of memory allocated along the
natural boundaries required for invoking SSE instructions at full
speed, as well as invoke specific SSE operations without dropping down
to assembler.
For that reason plus backward compatibility
> requirements,most executables on x86 runs on the plain pentium subset
> of instructions.
Not true. You can ask your compiler separate function definitions for
whatever flavors of SSE and x87. The runtime linker will bind the
most specific definitions for your processor at load time.
From: Wade Humeniuk
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date:
Message-ID: <8FqQd.61930$L_3.27342@clgrps13>
Gisle S�lensminde wrote:
> The OP compared to C++. As far as I know, few C++ compilers on x86 emits
> MMX or SSE instructions, much because these instructions is very hard to
> use in a generic way. For that reason plus backward compatibility requirements,
> most executables on x86 runs on the plain pentium subset of instructions. This seems
> to change with the x86-64, that has much more generic instructions.
>
But if the OP is really interested in SPEED then he would go for
the MMX/SSE/SSE2 instructions. Which means chosing a C++
compiler that supports that.
But like I said if the Lisp vendors
had embedded-instruction capabilities then us programmers could decide
when to use it (in this case restrict one's app to
Pentium III+). Then maybe the whining would die down to
a small whisper. In the extreme, Cliki could have
libraries of all kinds of useful low-level numeric routines
for various CPUs. Then only consenting-adults need use them.
Wade
From: Bulent Murtezaoglu
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date:
Message-ID: <874qgd1y3w.fsf@p4.internal>
>>>>> "WH" == Wade Humeniuk <··················@telus.net> writes:
[...]
WH> But like I said if the Lisp vendors had embedded-instruction
WH> capabilities then us programmers could decide when to use it
WH> (in this case restrict one's app to Pentium III+). [...]
I am pretty sure Corman does offer this. A functionally similar
facility exists and is rougly documented for sbcl:
http://sourceforge.net/mailarchive/forum.php?thread_id=177673&forum_id=4133
linked from
http://sbcl-internals.cliki.net/VOP
cheers,
BM
From: Wade Humeniuk
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date:
Message-ID: <AZrQd.61945$L_3.22135@clgrps13>
It appears that LW could potentially have it, though it is not documented.
See the PC86 package and SYSTEM:DEFASM, etc... (no
symbols are exported from the package).
In LW you can use my CAPI package-browser to peruse a package.
http://www3.telus.net/public/whumeniu/package-browser.lisp
Wade
From: Wade Humeniuk
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date:
Message-ID: <10sQd.61946$L_3.12727@clgrps13>
Wade Humeniuk wrote:
> It appears that LW could potentially have it, though it is not documented.
> See the PC86 package and SYSTEM:DEFASM, etc... (no
> symbols are exported from the package).
>
oops, that would be pc386 package.
Wade
From: Edi Weitz
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date:
Message-ID: <uekfhzlwa.fsf@agharta.de>
On Tue, 15 Feb 2005 20:23:15 +0200, Bulent Murtezaoglu <··@acm.org> wrote:
> I am pretty sure Corman does offer this.
Yes, see chapter 9 of
<http://www.cormanlisp.com/CormanLisp/CormanLisp_2_5.pdf>.
Of course this is easier for CCL than for other Lisps as most of the
other CL implementations (with the expection of MCL/OpenMCL) target
more than one processor family.
Cheers,
Edi.
--
Lisp is not dead, it just smells funny.
Real email: (replace (subseq ·········@agharta.de" 5) "edi")
+ Wade Humeniuk <··················@telus.net>:
| Well it does look slow to me. With vector based CPU instructions that
| routine could be reduced to one instruction. On the vector based
| machines I was familiar with (the old CYBER 205) there was a single
| machine instruction to multiply a vector by a scalar. I assume the
| newer x86 instructions provide something similar. Perhaps if
| the Lisp vendors provided an API to embed machine instructions
| directly and allow some low level access to get at the internal
| representation of vectors, then this would all be academic.
Or one could venture into FFI land and use BLAS routines properly
optimized for the CPU in question?
http://www.netlib.org/blas/faq.html
--
* Harald Hanche-Olsen <URL:http://www.math.ntnu.no/~hanche/>
- Debating gives most of us much more psychological satisfaction
than thinking does: but it deprives us of whatever chance there is
of getting closer to the truth. -- C.P. Snow
From: Bulent Murtezaoglu
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date:
Message-ID: <87zmy5zgdz.fsf@p4.internal>
>>>>> "HHO" == Harald Hanche-Olsen <······@math.ntnu.no> writes:
[...]
HHO> Or one could venture into FFI land and use BLAS routines
HHO> properly optimized for the CPU in question?
HHO> http://www.netlib.org/blas/faq.html
Or, even better, highly tuned ones:
http://math-atlas.sourceforge.net/
Matlisp (I believe) can use Atlas:
http://matlisp.sourceforge.net/
BM
>>>>> "rif" == rif <···@mit.edu> writes:
rif> Yes it can. I do this all the time.
rif> Of course, Matlisp only works on CMUCL or Allegro AFAIK.
I think sbcl should work as well.
Ray
Greetings! GCL is pretty competitive with CMUCL performance-wise in
my experience, and is supported on Windows.
Take care,
Wade Humeniuk <··················@telus.net> writes:
> Marc Battyani wrote:
>
> > Does not look very slow.
> > OK those declaration are not pretty but you only need them in some parts.
> >
>
> Well it does look slow to me. With vector based CPU instructions that
> routine could be reduced to one instruction. On the vector based
> machines I was familiar with (the old CYBER 205) there was a single
> machine instruction to multiply a vector by a scalar. I assume the
> newer x86 instructions provide something similar. Perhaps if
> the Lisp vendors provided an API to embed machine instructions
> directly and allow some low level access to get at the internal
> representation of vectors, then this would all be academic.
>
> Wade
>
--
Camm Maguire ····@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah
From: Bulent Murtezaoglu
Subject: Re: Can Windows CL compilers be as fast as CMUCL ?
Date:
Message-ID: <878y5p28tq.fsf@p4.internal>
>>>>> "MB" == Marc Battyani <·············@fractalconcept.com> writes:
[...]
MB> C++ is slow anyway. If you really want speed then program in
MB> VHDL or Verilog.
Do you have a socket library to interface to programs written in those?
BM
"Bulent Murtezaoglu" <··@acm.org> wrote
> >>>>> "MB" == Marc Battyani <·············@fractalconcept.com> writes:
> [...]
> MB> C++ is slow anyway. If you really want speed then program in
> MB> VHDL or Verilog.
>
> Do you have a socket library to interface to programs written in those?
Yes, I'm currently designing a small board (with an FPGA, some Flash and
SDRAM memory, and an Ethernet 100 PHY) where I do exactly this. The fast
computation is done in VHDL and the higher level stuff and socket
communication is done in C (I'm even looking for a small Lisp, see my last
week posts here) on a soft-processor embedded in the FPGA and running
uClinux. That's very cool ;-)
Here is an example of this kind of system with uClinux in a FPGA:
http://www.altera.com/products/devkits/altera/kit-nios_eval_1C12.html
BTW I'm still looking for a Common Lisp to run on this. So far the best
solutions I have found are OpenLisp (a nice ISLisp implementation, but not
multi-threaded) and Chicken (a Scheme).
Marc