From: alex goldman
Subject: Lisp on AMD 64
Date: 
Message-ID: <2823248.LMPUzLNHHo@yahoo.com>
How much faster is the compiled code of CMUCL/SBCL on AMD 64 compared to
Athlon XP? When running in 'compatibility mode', is AMD 64 essentially just
like Athlon XP (with the same 'effective clockspeed'), but twice the L1
cache and CPU-to-RAM communication speed?

From: Ulrich Hobelmann
Subject: Re: Lisp on AMD 64
Date: 
Message-ID: <3dad1uF6pt8r9U3@individual.net>
alex goldman wrote:
> On the same subject, how does AMD 64 3000+ get its "MHz number" ? Do they
> compare it to a 3 GHz 32-bit Pentium 4 GHz in compatibility mode?

The MHz number is its actual MHz.  The "3000+" means that an 
Athlon (not the XP, but the old Athlon "Thunderbird" that went up 
to 1400MHz) at 3000MHz would run roughly that fast.

The newer models (XP) mostly have larger caches, maybe they 
changed some internals as well.

-- 
No man is good enough to govern another man without that other's 
consent. -- Abraham Lincoln
From: Rob Warnock
Subject: Re: Lisp on AMD 64
Date: 
Message-ID: <2ZednWSMbaMAA-3fRVn-gg@speakeasy.net>
Juho Snellman  <······@iki.fi> wrote:
+---------------
| Other than that, the main reasons for using a 64-bit SBCL right now are:
|   * You have an Athlon 64, your Linux distribution is pure 64-bit without
|     a 32-bit compatibility environment, and you don't want to run a 32-bit
|     chroot.
+---------------

Say what? Which x86_64 Linux *doesn't* have a 32-bit compatibility
environment installed by default?!?

Anyway, FWIW, my workstation at work is an Athlon64 3600+ (?)
with a 1.8 GHz clock (I think, can't see the label at the moment
but "dmesg" says "3591.37 BogoMIPS" and "Detected 1802.357 MHz
TSC timer") running Linux 2.4.21-20.EL (some RedHat distro),
and it runs the pre-compiled 32-bit CMUCL-19a binary just fine:

    > (time (dotimes (i 2000000000)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   2.22f0 seconds of real time
    ;   2.22f0 seconds of user run time
    ;   0.0f0 seconds of system run time
    ;   4,011,445,780 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    > 

Compare this with the same CMUCL-19a on a 32-bit Athlon laptop
running FreeBSD [which says "mobile AMD Athlon(tm) XP2500+
(1855.20-MHz 686-class CPU)" and "Timecounter TSC frequency
1855201003 Hz"]:

    > (time (dotimes (i 2000000000)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   2.18f0 seconds of real time
    ;   2.160405f0 seconds of user run time
    ;   0.0f0 seconds of system run time
    ;   4,039,789,390 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    > 

Virtually identical, yes?

So, yes, while a 64-bit CMUCL would be interesting, I'm
really glad that the 32-bit CMUCL "just runs" on Athlon64!


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Juho Snellman
Subject: Re: Lisp on AMD 64
Date: 
Message-ID: <slrnd71ekb.beo.jsnell@sbz-31.cs.Helsinki.FI>
<····@rpw3.org> wrote:
> Juho Snellman  <······@iki.fi> wrote:
> +---------------
>| Other than that, the main reasons for using a 64-bit SBCL right now are:
>|   * You have an Athlon 64, your Linux distribution is pure 64-bit without
>|     a 32-bit compatibility environment, and you don't want to run a 32-bit
>|     chroot.
> +---------------
> 
> Say what? Which x86_64 Linux *doesn't* have a 32-bit compatibility
> environment installed by default?!?

Debian. Gentoo didn't have one either, but I hear they've fixed that
in their latest release.

-- 
Juho Snellman
"Premature profiling is the root of all evil."
From: Casper H.S. Dik
Subject: Re: Lisp on AMD 64
Date: 
Message-ID: <4270c2a1$0$153$e4fe514c@news.xs4all.nl>
Juho Snellman <······@iki.fi> writes:

><····@rpw3.org> wrote:
>> Juho Snellman  <······@iki.fi> wrote:
>> +---------------
>>| Other than that, the main reasons for using a 64-bit SBCL right now are:
>>|   * You have an Athlon 64, your Linux distribution is pure 64-bit without
>>|     a 32-bit compatibility environment, and you don't want to run a 32-bit
>>|     chroot.
>> +---------------
>> 
>> Say what? Which x86_64 Linux *doesn't* have a 32-bit compatibility
>> environment installed by default?!?

>Debian. Gentoo didn't have one either, but I hear they've fixed that
>in their latest release.

At least Solaris never has that problem (even gives you a choice of
32 or 64 bit kernel)

Casper
-- 
Expressed in this posting are my opinions.  They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.
From: Casper H.S. Dik
Subject: Re: Lisp on AMD 64
Date: 
Message-ID: <4270b232$0$151$e4fe514c@news.xs4all.nl>
Ulrich Hobelmann <···········@web.de> writes:

>Is there no way to use the extra registers with 32bits only, like 
>on the PowerPC?

No.

Casper
-- 
Expressed in this posting are my opinions.  They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.
From: Casper H.S. Dik
Subject: Re: Lisp on AMD 64
Date: 
Message-ID: <4270b38f$0$157$e4fe514c@news.xs4all.nl>
Juho Snellman <······@iki.fi> writes:

>For some values of "appropriate" :-) From the part you didn't quote:

>>> The extra registers of the x86-64 architecture don't seem to give
>>> much of a performance boost, except in some pathological cases.

Ah.

>Specifically, the code generator can use most of the extra registers 
>(a couple are reserved for internal uses). We still use the the x86
>port's internal calling convention that passes only the first three
>arguments in registers. The x86-64 port inherited a fair bit of
>register-starvation induced braindamage from x86. It's possible that
>fixing these problems is required to get the full benefit from the
>extra registers.

The UNIX ABI for AMD64 passes quite a lot in registers (this makes debugging
not a lot of fun)

>As for everything being faster in 64-bit mode... The following C code
>(which approximates my above example of LENGTH on a long list) is
>twice as fast on my Athlon 64 2800+ when compiled with gcc -m32 than
>when compiled natively. Clearly the same realities of limited memory
>bandwidth apply to C as well. I don't know why this doesn't affect
>real-world C programs. Maybe differences in programming style? "int" 
>still being a 32-bit type in the 64-bit Unix ABI?

Same here; and this is probably a memory bandwith limitation rather
than anything else (so perhaps it's an issue for Lisp at least for
CAR/CDR limited code)

Casper
-- 
Expressed in this posting are my opinions.  They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.
From: Duane Rettig
Subject: Re: Lisp on AMD 64
Date: 
Message-ID: <4u0lqam1y.fsf@franz.com>
Casper H.S. Dik <··········@Sun.COM> writes:

> Juho Snellman <······@iki.fi> writes:
> 
> >For some values of "appropriate" :-) From the part you didn't quote:
> 
> >>> The extra registers of the x86-64 architecture don't seem to give
> >>> much of a performance boost, except in some pathological cases.
> 
> Ah.
> 
> >Specifically, the code generator can use most of the extra registers 
> >(a couple are reserved for internal uses). We still use the the x86
> >port's internal calling convention that passes only the first three
> >arguments in registers. The x86-64 port inherited a fair bit of
> >register-starvation induced braindamage from x86. It's possible that
> >fixing these problems is required to get the full benefit from the
> >extra registers.
> 
> The UNIX ABI for AMD64 passes quite a lot in registers (this makes debugging
> not a lot of fun)

Huh?  Passing in more registers is always faster than passing in
fewer, and there is no reason why debugging need be any less fun
(at least, as "fun" as debuuging can be :-):

CL-USER(1): (compile
              (defun foo (a b c d)
                (declare (optimize speed (safety 0)))
                (caar (list a b c d))))
FOO
NIL
NIL
CL-USER(2): (disassemble *)
;; disassembly of #<Function FOO>
;; formals: A B C D

;; code start: #x1000933cf8:
   0: 48 83 ec 78    sub	rsp,$120
   4: 4c 89 74 24 08 movq	[rsp+8],r14
   9: 6a 04          pushb	$4
  11: 58             popq	rax
  12: 41 ff 57 0f    call	*[r15+15]    ; LIST
  16: 48 8b 7f ef    movq	rdi,[rdi-17]
  20: 48 8b 7f ef    movq	rdi,[rdi-17]
  24: f8             clc
  25: 48 8d 64 24 78 leaq	rsp,[rsp+120]
  30: 4c 8b 74 24 10 movq	r14,[rsp+16]
  35: c3             ret
CL-USER(3): (foo 10 20 30 40)
Error: Attempt to take the car of 10 which is not listp.
  [condition type: TYPE-ERROR]

Restart actions (select using :continue):
 0: Return to Top Level (an "abort" restart).
 1: Abort entirely from this (lisp) process.
[1] CL-USER(4): :zo :all t
Evaluation stack:

... 1 more newer frame ...

   (ERROR TYPE-ERROR :DATUM ...)
   (EXCL::.TYPE-ERROR 10 LIST ...)
   (EXCL::ERROR-FROM-CODE 1 10)
   (SYS::..CONTEXT-SAVING-RUNTIME-OPERATION)
 ->(FOO 10 20 ...)
   (SYS::..RUNTIME-OPERATION . :UNKNOWN-ARGS)
   [... EXCL::%EVAL ]
   (EVAL (FOO 10 20 ...))
   (TPL::READ-EVAL-PRINT-ONE-COMMAND NIL NIL)
   (EXCL::READ-EVAL-PRINT-LOOP :LEVEL 0)

... more older frames ...
[1] CL-USER(5): :cur
(FOO 10 20 30 40)
[1] CL-USER(6): 

I'm assuming by "not a lot of fun" you mean "less information".  Do
you see any such lack of information in the debug session above?

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: Casper H.S. Dik
Subject: Re: Lisp on AMD 64
Date: 
Message-ID: <42715387$0$158$e4fe514c@news.xs4all.nl>
Duane Rettig <·····@franz.com> writes:

>Huh?  Passing in more registers is always faster than passing in
>fewer, and there is no reason why debugging need be any less fun
>(at least, as "fun" as debuuging can be :-):


Tracing a 20-deep stack frame without being able to
see the arguments of calls and having to reconstruct them
is fairly daunting.

Casper
From: Duane Rettig
Subject: Re: Lisp on AMD 64
Date: 
Message-ID: <4pswea3kc.fsf@franz.com>
Casper H.S. Dik <··········@Sun.COM> writes:

> Duane Rettig <·····@franz.com> writes:
> 
> >Huh?  Passing in more registers is always faster than passing in
> >fewer, and there is no reason why debugging need be any less fun
> >(at least, as "fun" as debuuging can be :-):
> 
> 
> Tracing a 20-deep stack frame without being able to
> see the arguments of calls and having to reconstruct them
> is fairly daunting.

Right, and that's why I'm asking you why such daunting lack of
information must be the case just because the arguments are
passed in registers.  If, as in RISC architectures, efforts
are made to shadow registers into standard locations, then
such information need never be lacking.  In fact, it might
be interesting to note that even in our x86 versions of
Allegro CL, our lisp calling convention passes and returns the
first two arguments in registers (without requiring a lack of
information under debugging circumstances).

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: lin8080
Subject: Re: Lisp on AMD 64
Date: 
Message-ID: <42752487.CA80C361@freenet.de>
Ulrich Hobelmann schrieb:

> Is there no way to use the extra registers with 32bits only, like
> on the PowerPC?

> I think more registers might always help (ok, you said they
> didn't; maybe that's an implementation issue), but I don't think
> I'll ever need 64bit integers or pointers.

Hallo

To store one bit in memory it needs 6 transistors. Each transistor has a
small leak-voltage. This could change the "1" to a "0" ("0" is more
stable than "1"). 

The smaller the transistor becomes, this possibility rise up. There are
ways to write more "0" values into memory, but this also takes CPU time.

To store 512 KB you need .... and a big cooling-fan. 

Example: my RAM is refreshed about 400 times/sec by the north-bridge, to
guarantee the correct values. This is CPU expensive.

For intern memory the Mhz is pumped up to keep the refresh-times as low
as possible. But there are many many side-effects and a slow
outside-transfer. (sit and wait)

So, best that happens: use less registers, that changes the content more
often. Normaly one work with only two (four/+special) registers/cycle.

New boards speed up the bus-speed. It seems, till bus-speed (and all
components hanging there) is high enough, the Mhz-value is out of
sight...

stefan

(iterative vs parallel)
From: Ulrich Hobelmann
Subject: Re: Lisp on AMD 64
Date: 
Message-ID: <3dpk8dF6klg4fU1@individual.net>
lin8080 wrote:
> Ulrich Hobelmann schrieb:
> 
> 
>>Is there no way to use the extra registers with 32bits only, like
>>on the PowerPC?
> 
> 
>>I think more registers might always help (ok, you said they
>>didn't; maybe that's an implementation issue), but I don't think
>>I'll ever need 64bit integers or pointers.
> 
> 
> Hallo
> 
> To store one bit in memory it needs 6 transistors. Each transistor has a
> small leak-voltage. This could change the "1" to a "0" ("0" is more
> stable than "1"). 
> 
> The smaller the transistor becomes, this possibility rise up. There are
> ways to write more "0" values into memory, but this also takes CPU time.
> 
> To store 512 KB you need .... and a big cooling-fan. 

I don't quite understand you.  What are you saying and what's your 
point?

> Example: my RAM is refreshed about 400 times/sec by the north-bridge, to
> guarantee the correct values. This is CPU expensive.

The other way round: intensive CPU use tends to read/write stuff 
to/from memory.  The CPU usage is what you *want*; it's what your 
programmed your program for.

> For intern memory the Mhz is pumped up to keep the refresh-times as low
> as possible. But there are many many side-effects and a slow
> outside-transfer. (sit and wait)
> 
> So, best that happens: use less registers, that changes the content more
> often. Normaly one work with only two (four/+special) registers/cycle.
> 
> New boards speed up the bus-speed. It seems, till bus-speed (and all
> components hanging there) is high enough, the Mhz-value is out of
> sight...

I have no idea what you're trying to say.  I think if you have 
more registers, you might need less memory transfers (like PPC vs 
x86).  Of course if you write PPC code that's just like x86 code 
(like gcc does), then you don't get an advantage, and your PPC 
code will run as fast as an x86 *at the same frequency*, that is: 
slow.  Use more registers and your code does much more in parallel.

-- 
No man is good enough to govern another man without that other's 
consent. -- Abraham Lincoln