SBCL performance on OS X

From: William Bland
Subject: SBCL performance on OS X
Date: Tue, 06 Dec 2005 03:49:00 +0000
Message-ID: <pan.2005.12.06.03.48.59.912205@gmail.com>

I've been playing around for a while with close-to-out-of-memory
situations in Lisp.  As part of this investigation, I wrote a very simple
program that just gobbles up memory until the Lisp process dies.

Running this program seems to show an incredible disparity between the
speed of SBCL on Linux, and on OS X.  On Linux, the program dies after
less than a second, having allocated around 520MB.  On my Powerbook the
same program takes 1 minute 13 seconds to allocate the same amount of
memory!

More details are at:
http://www.livejournal.com/users/abstractstuff/11286.html

I'd be very interested to hear any suggestions on what might be causing
this huge difference in performance.  I'll be sure to follow-up with
anything else that I discover.

Thanks and best wishes,
	Bill.

Re: SBCL performance on OS X Edi Weitz
- Re: SBCL performance on OS X William Bland
  - Re: SBCL performance on OS X Luís Oliveira
    - Re: SBCL performance on OS X William Bland
      - Re: SBCL performance on OS X William Bland
Re: SBCL performance on OS X Ulrich Hobelmann
Re: SBCL performance on OS X Christophe Rhodes
- Re: SBCL performance on OS X André Thieme
  - Re: SBCL performance on OS X André Thieme
    - Re: SBCL performance on OS X Gareth McCaughan
- Re: SBCL performance on OS X Thomas F. Burdick
- Re: SBCL performance on OS X Christophe Rhodes
  - Re: SBCL performance on OS X William Bland
    - Re: SBCL performance on OS X Christophe Rhodes
      - Re: SBCL performance on OS X William Bland
        Re: SBCL performance on OS X Bradley J Lucier
        Re: SBCL performance on OS X William Bland

From: Edi Weitz
Subject: Re: SBCL performance on OS X
Date: Tue, 06 Dec 2005 03:58:46 +0000
Message-ID: <uy82yanbd.fsf@agharta.de>

On Tue, 06 Dec 2005 03:49:00 GMT, William Bland <···············@gmail.com> wrote:

> I've been playing around for a while with close-to-out-of-memory
> situations in Lisp.  As part of this investigation, I wrote a very
> simple program that just gobbles up memory until the Lisp process
> dies.
>
> Running this program seems to show an incredible disparity between
> the speed of SBCL on Linux, and on OS X.  On Linux, the program dies
> after less than a second, having allocated around 520MB.  On my
> Powerbook the same program takes 1 minute 13 seconds to allocate the
> same amount of memory!
>
> More details are at:
> http://www.livejournal.com/users/abstractstuff/11286.html
>
> I'd be very interested to hear any suggestions on what might be
> causing this huge difference in performance.

I'm tempted to say that Macs are just a bit slower... :)

But, seriously, on Linux you use SBCL 0.9.4, on OS X you use SBCL
0.9.6.  You should probably use the same version on both platforms
before you draw any further conclusions.

Cheers,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: William Bland
Subject: Re: SBCL performance on OS X
Date: Tue, 06 Dec 2005 04:17:18 +0000
Message-ID: <pan.2005.12.06.04.17.13.845473@gmail.com>

On Tue, 06 Dec 2005 04:58:46 +0100, Edi Weitz wrote:

> On Tue, 06 Dec 2005 03:49:00 GMT, William Bland <···············@gmail.com> wrote:
> 
>> I'd be very interested to hear any suggestions on what might be
>> causing this huge difference in performance.
> 
> I'm tempted to say that Macs are just a bit slower... :)
> 
> But, seriously, on Linux you use SBCL 0.9.4, on OS X you use SBCL
> 0.9.6.  You should probably use the same version on both platforms
> before you draw any further conclusions.
> 
> Cheers,
> Edi.

That kind of a slowdown would seem like a fairly impressive regression! 
Point taken though - I'll test again with the same version on both...

Thanks and best wishes,
	Bill.

From: Luís Oliveira
Subject: Re: SBCL performance on OS X
Date: Tue, 06 Dec 2005 04:57:50 +0000
Message-ID: <m264q2om9d.fsf@deadspam.com>

William Bland <···············@gmail.com> writes:
> That kind of a slowdown would seem like a fairly impressive regression! 
> Point taken though - I'll test again with the same version on both...

Try OpenMCL too.

-- 
Luís Oliveira
luismbo (@) gmail (.) com
Equipa Portuguesa do Translation Project
http://www.iro.umontreal.ca/translation/registry.cgi?team=pt

From: William Bland
Subject: Re: SBCL performance on OS X
Date: Tue, 06 Dec 2005 07:40:37 +0000
Message-ID: <pan.2005.12.06.07.40.37.159326@gmail.com>

On Tue, 06 Dec 2005 04:57:50 +0000, Lu�s Oliveira wrote:

> William Bland <···············@gmail.com> writes:
>> That kind of a slowdown would seem like a fairly impressive regression! 
>> Point taken though - I'll test again with the same version on both...
> 
> Try OpenMCL too.

I haven't got the SBCL versions synced yet, but I did install OpenMCL and
added simple timing code to the memory gobbler.

OpenMCL comes in around an order of magnitude faster than SBCL, but still
an order of magnitude slower than SBCL on the server.
Details:  http://www.livejournal.com/users/abstractstuff/11757.html

I'm going to sync the SBCL versions, and also try the equivalent C program
on both the server and the powerbook, to see how much of this difference
is coming from the Lisp implementations and how much is from the OS X
kernal and hardware.

Best wishes,
	Bill.

From: William Bland
Subject: Re: SBCL performance on OS X
Date: Tue, 06 Dec 2005 08:58:22 +0000
Message-ID: <pan.2005.12.06.08.58.21.844399@gmail.com>

On Tue, 06 Dec 2005 07:40:37 +0000, William Bland wrote:

> I'm going to sync the SBCL versions, and also try the equivalent C program
> on both the server and the powerbook, to see how much of this difference
> is coming from the Lisp implementations and how much is from the OS X
> kernal and hardware.

I just got the equivalent C program running.  This time the difference
between the powerbook and the server is nothing like an order of magnitude.
Details:  http://www.livejournal.com/users/abstractstuff/11805.html

This would seem to suggest that neither the OS X kernel nor the powerbook
hardware are to blame, and that OS X Lisp implementations may have some
room for improvement (of course the program I'm running is very
artificial, so do take this with a pinch of salt).

Best wishes,
	Bill.
-- 
William Bland.      http://www.abstractnonsense.com/

From: Ulrich Hobelmann
Subject: Re: SBCL performance on OS X
Date: Tue, 06 Dec 2005 06:31:30 +0000
Message-ID: <3vkpi2F16ij8fU1@individual.net>

William Bland wrote:
> I've been playing around for a while with close-to-out-of-memory
> situations in Lisp.  As part of this investigation, I wrote a very simple
> program that just gobbles up memory until the Lisp process dies.
> 
> Running this program seems to show an incredible disparity between the
> speed of SBCL on Linux, and on OS X.  On Linux, the program dies after
> less than a second, having allocated around 520MB.  On my Powerbook the
> same program takes 1 minute 13 seconds to allocate the same amount of
> memory!

Hm, the Mac kernel (at least on 10.3 when I tried) seems to be *very* 
slow.  I noticed that inter-thread communication (conditions) takes 
orders of magnitude more time than for instance on FreeBSD.

Now I'd expect simple memory block allocation to be waaay faster and 
easier to do that threads, but maybe the Mac has problems in its memory 
page manager?

Have you run Activity Monitor?  Is it mostly user (green) or mostly 
system time (red)?

-- 
Majority, n.: That quality that distinguishes a crime from a law.

From: Christophe Rhodes
Subject: Re: SBCL performance on OS X
Date: Tue, 06 Dec 2005 09:00:23 +0000
Message-ID: <sqmzjeob14.fsf@cam.ac.uk>

William Bland <···············@gmail.com> writes:

> I've been playing around for a while with close-to-out-of-memory
> situations in Lisp.  As part of this investigation, I wrote a very simple
> program that just gobbles up memory until the Lisp process dies.
>
> Running this program seems to show an incredible disparity between the
> speed of SBCL on Linux, and on OS X.  

No, it doesn't.  Judging by your weblog entry, it shows a disparity
between the speed of SBCL on an x86 and on a powerpc.  That is, I
would expect SBCL on OS X/x86 (when such a beast appears) to perform
similarly to SBCL on any other OS on the x86, while I would expect
SBCL on Linux/ppc to perform similarly to SBCL on OS X/ppc.

> On Linux, the program dies after less than a second, having
> allocated around 520MB.  On my Powerbook the same program takes 1
> minute 13 seconds to allocate the same amount of memory!

The garbage collection implementations on x86-family machines and
non-x86es for SBCL are completely different.  The non-x86 machines use
a simple Cheney semi-space collector, which is relatively efficient
when almost all data is garbage and very inefficient when almost all
data is live: specifically, garbage collection time is proportional to
the amount of live data.  x86es use a mostly-copying (conservative)
generational collector, supported by a write barrier, which
empirically has much more even performance over a range of datasets,
and rarely needs to do a full collection; for semispace collectors,
every collection is a full collection.  If you want to allocate memory
more quickly, try setting (bytes-consed-between-gcs) to something
larger than its default value: say 100Mb.

As for why: there is no fundamental reason, but the longer history of
a development-friendly x86 platform than that of a powerpc platform
seems to have encouraged garbage collection optimizations for the x86
from contributors, who for the most part (but not exclusively) work
independently of commercial incentives.  It would take a small amount
of sustained effort to port the garbage collector and allocator to a
non-x86 platform.

Christophe

From: André Thieme
Subject: Re: SBCL performance on OS X
Date: Tue, 06 Dec 2005 09:27:05 +0000
Message-ID: <1133861225.722528.236650@z14g2000cwz.googlegroups.com>

When doing what Christophe suggested
(setf (bytes-consed-between-gcs) (* 100 1024 1024))
the version of William takes ca. 0.14 seconds until the memory limit is
reached.
The C version is faster, but not in orders of magnitudes, it runs ca.
0.13 seconds.
(P4 2800, 1 GB, Gentoo, cmucl / gcc)


André
--

From: André Thieme
Subject: Re: SBCL performance on OS X
Date: Tue, 06 Dec 2005 09:29:37 +0000
Message-ID: <1133861377.125715.325520@g43g2000cwa.googlegroups.com>

André Thieme schrieb:

> the version of William takes ca. 0.14 seconds until the memory limit is [...]

the Lisp version of William...

From: Gareth McCaughan
Subject: Re: SBCL performance on OS X
Date: Wed, 07 Dec 2005 00:31:22 +0000
Message-ID: <87pso9daex.fsf@g.mccaughan.ntlworld.com>

Andr� Thieme wrote:

> Andr� Thieme schrieb:
> 
>> the version of William takes ca. 0.14 seconds until the memory limit is [...]
> 
> the Lisp version of William...

What other languages has William been implemented in?

-- 
Gareth McCaughan
.sig under construc

From: Thomas F. Burdick
Subject: Re: SBCL performance on OS X
Date: Tue, 06 Dec 2005 09:49:56 +0000
Message-ID: <xcvk6eipnaz.fsf@conquest.OCF.Berkeley.EDU>

Christophe Rhodes <·····@cam.ac.uk> writes:

> William Bland <···············@gmail.com> writes:
>
> > On Linux, the program dies after less than a second, having
> > allocated around 520MB.  On my Powerbook the same program takes 1
> > minute 13 seconds to allocate the same amount of memory!
> 
> The garbage collection implementations on x86-family machines and
> non-x86es for SBCL are completely different.  The non-x86 machines use
> a simple Cheney semi-space collector, which is relatively efficient
> when almost all data is garbage and very inefficient when almost all
> data is live: specifically, garbage collection time is proportional to
> the amount of live data.  x86es use a mostly-copying (conservative)
> generational collector, supported by a write barrier, which
> empirically has much more even performance over a range of datasets,
> and rarely needs to do a full collection; for semispace collectors,
> every collection is a full collection.  If you want to allocate memory
> more quickly, try setting (bytes-consed-between-gcs) to something
> larger than its default value: say 100Mb.
> 
> As for why: there is no fundamental reason, but the longer history of
> a development-friendly x86 platform than that of a powerpc platform
> seems to have encouraged garbage collection optimizations for the x86
> from contributors, who for the most part (but not exclusively) work
> independently of commercial incentives.  It would take a small amount
> of sustained effort to port the garbage collector and allocator to a
> non-x86 platform.

And in order to see why no one has thought this was vital enough to
their work to actually port the generational gc to ppc, the OP might
be interested in trying other mini-benchmarks: tuning the GC with
different parameters, and also trying differing mixes of live data and
garbage in what's allocated.  In real programs, where there does tend
to be a lot of garbage, the two-space gc isn't optimal, but it's not
terribly bad, either.

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | Free Mumia Abu-Jamal! |
     ,--'    _,'   | Abolish the racist    |
    /       /      | death penalty!        |
   (   -.  |       `-----------------------'
   |     ) |                               
  (`-.  '--.)                              
   `. )----'

From: Christophe Rhodes
Subject: Re: SBCL performance on OS X
Date: Tue, 06 Dec 2005 10:02:15 +0000
Message-ID: <sqiru2o860.fsf@cam.ac.uk>

Christophe Rhodes <·····@cam.ac.uk> writes:

> William Bland <···············@gmail.com> writes:
>
>> On Linux, the program dies after less than a second, having
>> allocated around 520MB.  On my Powerbook the same program takes 1
>> minute 13 seconds to allocate the same amount of memory!
>
> The garbage collection implementations on x86-family machines and
> non-x86es for SBCL are completely different.  The non-x86 machines use
> a simple Cheney semi-space collector [...]

One thing I failed to mention, but is relevant for your experiment, is
that the "semi-space"ness is important: if there is 512Mb of live
data, the working set of the GC is 1024Mb: because that live data has
to be copied to the new semi-space.  Thus, you will hit swap earlier
than you might expect.

Other things from your blog that I'll respond to:

> I had been led to believe that modern GCs could actually be faster
> than C at allocating memory.

I don't think anyone's ever claimed that.  However, modern GCs can
actually be faster than C at memory management: that is, you need not
just malloc() but also free(), or its equivalent.  There are also
usually weasel-words such as "for real-world tasks" associated,
because for your experimental application the obvious winning strategy
is a malloc() which bumps a heap free-pointer and a free() which does
nothing: I guarantee that this will beat any implementation you can
come up with, and also that this implementation strategy would be
wildly unpopular with every other user :-).

> OpenMCL on the powerbook is an order of magnitude slower than SBCL
> on the server

OpenMCL, similar to SBCL on x86s, has a generational (I think they
call it "ephemeral", but it means the same) garbage collector, so I
would expect it to perform better than SBCL does on powerpcs; the
difference you see between OpenMCL on the powerpc and SBCL on the x86
is more likely to be representative of the true difference in memory
handling between the hardware/kernel platforms than straight
comparisons between SBCLs.

>        if(malloc(10*1024*1024) == NULL)
>        {
>            ok = 0;
>        }

This body of your "equivalent" C program doesn't lend itself to
comparison with the lisp implementation of your experiment terribly
well.  In likely ascending order of seriousness:

* the preprocessor will constant-fold away the multiplication here,
  whereas you have a variable allocation in the lisp;

* your timing code does exact integer division in lisp and approximate
  floating point division in C; the exact integer division is likely
  to create garbage;

* you don't do anything with the returned pointer, whereas in lisp you
  allocate an extra two-word structure to store it in, and mutate the
  heap;

* you are returning uninitialized memory here, whereas I would guess
  that the lisp make-array returns initialized memory; to measure the
  same amount of work, I would suggest using calloc() instead.

Which of these actually matters I don't know; I'd be surprised if the
first had any measurable effect; the fourth is likely to be quite
serious, as it involves touching 10Mb of memory per iteration; the
third is more interesting, because it potentially leads to heap
fragmentation.

> if I can't rapidly make Lisp processes die with out-of-memory
> errors

Possibly the most straightforward way of achieving this is to shrink
the available heap size.  This is easy to do for SBCL: simply change
the parameters in src/compiler/ppc/parms.lisp and recompile.

> If anything, the powerbook has the edge over the server (in terms of
> raw MHz/GB numbers - yes, I know these are completely different chip
> architectures though).

The difference in the CPU chip architecture is probably not relevant;
much more important for this experiment, since almost no computation
is being done, are memory bandwidth and latency.

Christophe

From: William Bland
Subject: Re: SBCL performance on OS X
Date: Tue, 06 Dec 2005 18:20:35 +0000
Message-ID: <pan.2005.12.06.18.20.34.245264@gmail.com>

On Tue, 06 Dec 2005 10:02:15 +0000, Christophe Rhodes wrote:
> 
> Possibly the most straightforward way of achieving this is to shrink
> the available heap size.  This is easy to do for SBCL: simply change
> the parameters in src/compiler/ppc/parms.lisp and recompile.
> 

Thanks very much for taking the time to reply with all this useful
information Christophe!  I had been assuming that SBCL used a single GC
over all platforms.  Things begin to make a lot more sense now.

Your comment above made me wonder if there's any fundamental reason nobody
has added a maximum-heap-size command-line switch to SBCL.  Perhaps it's
just lack of time and/or interest?  If that's the case I might have a go
at doing it myself.

Thanks again and best wishes,
	Bill.

From: Christophe Rhodes
Subject: Re: SBCL performance on OS X
Date: Tue, 06 Dec 2005 18:35:29 +0000
Message-ID: <sq4q5mul8u.fsf@cam.ac.uk>

William Bland <···············@gmail.com> writes:

> Thanks very much for taking the time to reply with all this useful
> information Christophe!  I had been assuming that SBCL used a single GC
> over all platforms.  Things begin to make a lot more sense now.

You're welcome.

Something you might be interested to see, as well as the calloc()
variant I suggested, is a C program whose allocation loop uses
malloc() to allocate the memory and then memset() to set it to some
value (say, 0): something like
  void *foo;
  if ((foo = malloc(10*1024*1024)) == NULL) {
    ok = 0;
  } else {
    memset(foo,0,10*1024*1024);
  }
I was mildly surprised at the performance characteristics that had on
my laptop when I tried it earlier.

Christophe

From: William Bland
Subject: Re: SBCL performance on OS X
Date: Sat, 10 Dec 2005 16:06:11 +0000
Message-ID: <pan.2005.12.10.16.06.10.936856@gmail.com>

On Tue, 06 Dec 2005 18:35:29 +0000, Christophe Rhodes wrote:
> 
> Something you might be interested to see, as well as the calloc()
> variant I suggested, is a C program whose allocation loop uses
> malloc() to allocate the memory and then memset() to set it to some
> value (say, 0): something like
>   void *foo;
>   if ((foo = malloc(10*1024*1024)) == NULL) {
>     ok = 0;
>   } else {
>     memset(foo,0,10*1024*1024);
>   }
> I was mildly surprised at the performance characteristics that had on
> my laptop when I tried it earlier.
> 

Thanks Christophe, I finally got a chance to spend some time on this, and
was stunned by the difference.

I wrote a little about it at
http://www.livejournal.com/users/abstractstuff/12411.html

Thanks again and best wishes,
	Bill.
-- 
William Bland.      http://www.abstractnonsense.com/

From: Bradley J Lucier
Subject: Re: SBCL performance on OS X
Date: Sat, 10 Dec 2005 17:18:48 +0000
Message-ID: <dnf2lo$nd8@arthur.cs.purdue.edu>

In article <······························@gmail.com>,
William Bland  <···············@gmail.com> wrote:
>Thanks Christophe, I finally got a chance to spend some time on this, and
>was stunned by the difference.
>
>I wrote a little about it at
>http://www.livejournal.com/users/abstractstuff/12411.html


From 

http://developer.apple.com/documentation/Performance/Conceptual/ManagingMemory/index.html

we have

> When you call memset right after malloc, the virtual memory system
> must map the corresponding pages into memory in order to zero-initialize
> them. This operation can be very expensive and wasteful, especially
> if you do not use the pages right away.
> 
> The calloc routine reserves the required virtual address space
> for the memory but waits until the memory is actually used before
> initializing it. This approach alleviates the need to map the
> pages into memory right away. It also lets the system initialize
> pages as theyre used, as opposed to all at once.

It appears that calloc doesn't do anything except set up some information the
virtual memory map.

Brad

From: William Bland
Subject: Re: SBCL performance on OS X
Date: Sat, 10 Dec 2005 20:36:20 +0000
Message-ID: <pan.2005.12.10.20.36.19.590269@gmail.com>

On Sat, 10 Dec 2005 12:18:48 -0500, Bradley J Lucier wrote:

> In article <······························@gmail.com>,
> William Bland  <···············@gmail.com> wrote:
>>http://www.livejournal.com/users/abstractstuff/12411.html
> 
[snip]
> 
> It appears that calloc doesn't do anything except set up some information the
> virtual memory map.
> 
> Brad

Hey, thanks Brad!  Nice to see I wasn't barking up the wrong tree.

Best wishes,
	Bill.
-- 
William Bland.      http://www.abstractnonsense.com/