From: Joe Marshall
Subject: Re: reducing number consing
Date: 
Message-ID: <7k0yqex0.fsf@ccs.neu.edu>
Cliff Crawford <·····@cornell.edu> writes:

> I'm trying to write a simple neural network simulation in Lisp, but
> I'm having trouble with number consing in the inner loop in the
> training function.  According to SBCL's profiler, the function
> feed-forward (see below) conses 32 bytes every time it's called, but
> I'd like it to not cons at all if possible.  

Why?  Is GC an issue here?

From: Cliff Crawford
Subject: Re: reducing number consing
Date: 
Message-ID: <kg3Eb.36917$Ug6.16010@twister.nyroc.rr.com>
On 2003-12-15, Joe Marshall <···@ccs.neu.edu> wrote:
|  Cliff Crawford <·····@cornell.edu> writes:
| 
| > I'm trying to write a simple neural network simulation in Lisp, but
| > I'm having trouble with number consing in the inner loop in the
| > training function.  According to SBCL's profiler, the function
| > feed-forward (see below) conses 32 bytes every time it's called, but
| > I'd like it to not cons at all if possible.  
| 
|  Why?  Is GC an issue here?

I'm worried it will be, because in typical usage it will get called
100,000 - 1,000,000 times.  It's certainly fast enough for my test
cases, though.  Anyway, I'll go ask on the SBCL mailing list (I didn't
realize this would be an implementation-specific thing).  Thanks for
all the responses.


-- 
 Cliff Crawford                ***                  ·····@cornell.edu

"Platypus? I thought it was pronounced platymapus. Has it always been
 pronounced platypus?"                             -- Jessica Simpson
From: Raymond Toy
Subject: Re: reducing number consing
Date: 
Message-ID: <4n4qvzdrm8.fsf@edgedsp4.rtp.ericsson.se>
>>>>> "Cliff" == Cliff Crawford <·····@cornell.edu> writes:

    Cliff> On 2003-12-15, Joe Marshall <···@ccs.neu.edu> wrote:
    Cliff> |  Cliff Crawford <·····@cornell.edu> writes:
    Cliff> | 
    Cliff> | > I'm trying to write a simple neural network simulation in Lisp, but
    Cliff> | > I'm having trouble with number consing in the inner loop in the
    Cliff> | > training function.  According to SBCL's profiler, the function
    Cliff> | > feed-forward (see below) conses 32 bytes every time it's called, but
    Cliff> | > I'd like it to not cons at all if possible.  
    Cliff> | 
    Cliff> |  Why?  Is GC an issue here?

    Cliff> I'm worried it will be, because in typical usage it will get called
    Cliff> 100,000 - 1,000,000 times.  It's certainly fast enough for my test
    Cliff> cases, though.  Anyway, I'll go ask on the SBCL mailing list (I didn't
    Cliff> realize this would be an implementation-specific thing).  Thanks for
    Cliff> all the responses.

Shouldn't you profile it with real stuff instaed of test code before
micro-optimizing this? 

Luke Gorrie was saying that in one of his ethernet switch (?) 
programs, he was consing for every single ethernet packet.  It turned
out that it didn't really matter, and generational GC took care of the
consing quite well.

Ray
From: Luke Gorrie
Subject: Re: reducing number consing
Date: 
Message-ID: <lhad5rukjx.fsf@dodo.bluetail.com>
Raymond Toy <···@rtp.ericsson.se> writes:

> Luke Gorrie was saying that in one of his ethernet switch (?) 
> programs, he was consing for every single ethernet packet.

No need to rely on hearsay from me. TIME will say how much time was
spent in GC anyway (if you tear your eyes away "bytes consed" :-)

~20MB/sec of consing cost me <10% CPU, which was just fine with me.
YMMV.

Cheers,
Luke
From: Rahul Jain
Subject: Re: reducing number consing
Date: 
Message-ID: <87d6ahq25j.fsf@nyct.net>
Jan Rychter <···@rychter.com> writes:

> I've recently been wondering how much of a real problem it is that most
> (all?) GCs tend to move data around in memory. This wasn't really a
> problem back in the days where memory was running at approximately same
> speed as the CPU -- but these days (on PC architectures at least) our
> memories are still below 400MHz, while our CPUs go way up into GHz. A
> single memory access (with a cache miss) can easily cost you hundreds of
> CPU instructions. Cache suddenly becomes extremely valuable and moving
> data in memory generally ruins that.

Not all GCs move the data around, but for those that do, the fact that
they actively *compact* the memory actually makes them more
cache-efficient. Also, they often move the data by following
pointers. This means that following a chain of references is likely to
involve accessing data that is spatially near.

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: reducing number consing
Date: 
Message-ID: <pan.2003.12.22.15.12.50.546725@knm.org.pl>
On Sun, 21 Dec 2003 16:41:44 -0500, Rahul Jain wrote:

> Also, they often move the data by following pointers. This means that
> following a chain of references is likely to involve accessing data that
> is spatially near.

The Cheney algorithm performs a breadth-first traversal of live objects,
so it actually puts referenced data far from the references. Is there
a reasonably simple modification which does this better?

Depth-first would be good but it would require to maintain a separate
stack of objects to be processed. Since object chains can be long, it
can't even be the native stack. But maybe this would be the answer?

I've just read <http://research.microsoft.com/~trishulc/papers/ismm_paper.pdf>
but realtime profiling is a bit too intrusive and complicated for my taste.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: Joe Marshall
Subject: Re: reducing number consing
Date: 
Message-ID: <ekuwq19o.fsf@comcast.net>
Marcin 'Qrczak' Kowalczyk <······@knm.org.pl> writes:

> On Sun, 21 Dec 2003 16:41:44 -0500, Rahul Jain wrote:
>
>> Also, they often move the data by following pointers. This means that
>> following a chain of references is likely to involve accessing data that
>> is spatially near.
>
> The Cheney algorithm performs a breadth-first traversal of live objects,
> so it actually puts referenced data far from the references. Is there
> a reasonably simple modification which does this better?

Use a stack.

> Depth-first would be good but it would require to maintain a separate
> stack of objects to be processed.  Since object chains can be long, it
> can't even be the native stack.  But maybe this would be the answer?

There are a couple of ways to deal with this.  One is to use a bounded
stack and to fall back on breadth-first when the stack overflows.  But
these days, it's pretty easy to just make the stack big enough.


-- 
~jrm
From: Richard Fateman
Subject: Re: reducing number consing / cache experimental results
Date: 
Message-ID: <SdqFb.1252$XZ7.960@newssvr29.news.prodigy.com>
Jan Rychter wrote:

> Mind you, this is just speculation, I don't have measurements. I was
> just wondering whether on today's architectures it isn't better to burn
> more CPU cycles in a GC that tries very hard to touch as little data in
> memory as possible.
> 
> I'm curious if any of the implementors present here have anything to say
> about this?
> 

Check out
http://www.cs.berkeley.edu/~fateman/papers/cachelisp.pdf

Yes, you can and probably should do things to reduce cache misses.
The positive side is that recent chips tend to have rather large
caches (L2 cache, anyway), and many programs will mostly fit there.
Of course you might find that much of it is thrashed periodically
by your gluttonous operating system, but that's life.  Unfortunately
the L1 cache is generally too small to store enough to matter.

Also there are absolutely great tools (PAPI) for trying things out.
I used Allegro CL but other FFI-capable Lisps should be able to run
the cache statistics registers on Pentia, Sparc, etc.

RJF
From: Raymond Toy
Subject: Re: reducing number consing / cache experimental results
Date: 
Message-ID: <4nsmjc7vss.fsf@edgedsp4.rtp.ericsson.se>
>>>>> "Richard" == Richard Fateman <········@sbcglobal.net> writes:

    Richard> Also there are absolutely great tools (PAPI) for trying things out.
    Richard> I used Allegro CL but other FFI-capable Lisps should be able to run
    Richard> the cache statistics registers on Pentia, Sparc, etc.

Eric Marsden's CL benchmark suite has a set of routines for accessing
these registers for CMUCL.  The last time I ran it with his benchmark
routines, I don't think cache misses were much of problem for this
benchmark.  FWIW.  YMMV.

Ray
From: Richard Fateman
Subject: Re: reducing number consing / cache experimental results
Date: 
Message-ID: <6OGFb.1603$X11.918@newssvr29.news.prodigy.com>
I had not seen Eric Marsden's cache counting, but looking at it,
it seems to be specific to CMU-CL and Sparc v9.  PAPI is available
for users on linux or windows on Intel, as well as Sparc, Alpha,
PowerPC. Presumably there is no inherent dependence on CMU-CL in
the Sun performance montorin routines!


In my experiments you have to exceed the L2 cache size to see things
slow down. You generally can't fit enough in the L1 cache to see a
slowdown between L1 and L2 sizes.  My guess is that for a lot of
small to medium-large lisp calculations the built-in cache behavior
is quite acceptable, which probably includes most benchmarks that
predate the era of gigabyte RAM.  Of course there are a couple more
cliffs to fall off:  exceeding L2 cache, exceeding L3 cache if there
is such a thing (maybe out at 2.5-3megabytes?) exceeding RAM, when
you fall into the hole of running paging to disk.

RJF


Raymond Toy wrote:
>>>>>>"Richard" == Richard Fateman <········@sbcglobal.net> writes:
> 
> 
>     Richard> Also there are absolutely great tools (PAPI) for trying things out.
>     Richard> I used Allegro CL but other FFI-capable Lisps should be able to run
>     Richard> the cache statistics registers on Pentia, Sparc, etc.
> 
> Eric Marsden's CL benchmark suite has a set of routines for accessing
> these registers for CMUCL.  The last time I ran it with his benchmark
> routines, I don't think cache misses were much of problem for this
> benchmark.  FWIW.  YMMV.
> 
> Ray
From: Rob Warnock
Subject: Re: reducing number consing / cache experimental results
Date: 
Message-ID: <M5udnf25p89IEXeiRVn-uQ@speakeasy.net>
Richard Fateman  <········@sbcglobal.net> wrote:
+---------------
| In my experiments you have to exceed the L2 cache size to see things
| slow down. 
+---------------

Ungar's original paper[1] proposing a fixed-location nursery for
generational GCs made exactly this point: For best performance
the size of the nursery should be somewhere around the size of
the secondary cache.


-Rob

[1] I don't have the reference at hand at the moment, but could dig
    it out of my hard-copy file of GC papers when I get back home
    next week.

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Raymond Toy
Subject: Re: reducing number consing / cache experimental results
Date: 
Message-ID: <4n7k0fzmi1.fsf@edgedsp4.rtp.ericsson.se>
>>>>> "Richard" == Richard Fateman <········@sbcglobal.net> writes:

    Richard> I had not seen Eric Marsden's cache counting, but looking at it,
    Richard> it seems to be specific to CMU-CL and Sparc v9.  PAPI is available
    Richard> for users on linux or windows on Intel, as well as Sparc, Alpha,
    Richard> PowerPC. Presumably there is no inherent dependence on CMU-CL in
    Richard> the Sun performance montorin routines!

No, there's not.  It just needs an FFI to the Sun routines, and I
guess Eric only made them work with CMUCL.  But Sparc v9 is a
requirement, I think, because that's when the performance monitoring
registers were made available.

    Richard> In my experiments you have to exceed the L2 cache size to see things
    Richard> slow down. You generally can't fit enough in the L1 cache to see a
    Richard> slowdown between L1 and L2 sizes.  My guess is that for a lot of
    Richard> small to medium-large lisp calculations the built-in cache behavior
    Richard> is quite acceptable, which probably includes most benchmarks that
    Richard> predate the era of gigabyte RAM.  Of course there are a couple more

I think the benchmarks themselves are fairly small in code space.
Some of the benchmarks do, however, cons quite a bit.  Say a few
hundred megabytes for some tests.

Ray
From: Eric Marsden
Subject: Re: reducing number consing / cache experimental results
Date: 
Message-ID: <wzipte75qrj.fsf@melbourne.laas.fr>
>>>>> "rt" == Raymond Toy <···@rtp.ericsson.se> writes:
>>>>> "Richard" == Richard Fateman <········@sbcglobal.net> writes:

  Richard> I had not seen Eric Marsden's cache counting, but looking at it,
  Richard> it seems to be specific to CMU-CL and Sparc v9.  PAPI is available
  Richard> for users on linux or windows on Intel, as well as Sparc, Alpha,
  Richard> PowerPC. Presumably there is no inherent dependence on CMU-CL in
  Richard> the Sun performance montorin routines!

  rt> No, there's not.  It just needs an FFI to the Sun routines, and I
  rt> guess Eric only made them work with CMUCL.  But Sparc v9 is a
  rt> requirement, I think, because that's when the performance monitoring
  rt> registers were made available.

the Solaris CPU Performance Counters API also provides access to the
performance counters when running on x86 platforms. However, the types
of events that can be monitored differ from UltraSPARC, and I don't
have any Solaris/x86 machines, so haven't investigated this.

Accessing the x86 performance counters from Linux requires a kernel
patch that I haven't yet installed. I expect that comparing the
CMUCL/x86 and CMUCL/UltraSPARC results would be interesting, since
recent x86 processors do out-of-order execution, which could reduce
the impact of the relatively poor instruction scheduling of the code
generated by CMUCL.

-- 
Eric Marsden                          <URL:http://www.laas.fr/~emarsden/>
From: Eric Marsden
Subject: Re: reducing number consing / cache experimental results
Date: 
Message-ID: <wzioetr5qq3.fsf@melbourne.laas.fr>
>>>>> "rf" == Richard Fateman <········@sbcglobal.net> writes:

  rf> In my experiments you have to exceed the L2 cache size to see things
  rf> slow down. You generally can't fit enough in the L1 cache to see a
  rf> slowdown between L1 and L2 sizes.  My guess is that for a lot of
  rf> small to medium-large lisp calculations the built-in cache behavior
  rf> is quite acceptable, which probably includes most benchmarks that
  rf> predate the era of gigabyte RAM.

I have run my performance benchmarks on 3 UltraSPARC machines: an el
cheapo Blade 100, a Sun-Fire-V240 and a Sun-Fire-280R. All machines
have enough RAM to run the tests without swapping.

   - UltraSPARC-IIe @ 500MHz: 16kB icache, 16kB dcache, 256kB L2 cache
   - UltraSPARC-IIIi @ 1GHz:  32kB icache, 65kB dcache, 1MB L2 cache
   - UltraSPARC-III @ 700MHz: 32kB icache, 65kB dcache, 8MB L2 cache

These experiments suggest that:

  - many of the benchmarks _do_ fit into the L1 instruction cache, but
    for those with a larger code size (eg using the pretty-printer, use of
    complex numbers, or the entire CMUCL compiler) there can be a 30%
    miss rate. 

  - a larger L2 cache doesn't seem to help greatly over 1MB, except
    for benchmarks that are designed to thrash the cache. (It sure
    does increase the price of the processor, though :-)

  - the quality of instruction scheduling seems to be more important
    on UltraSPARC than considerations related to cache: on average
    around 25% of cycles are lost to load stalls.

  
I used the processor's performance counters to measure metrics such as

   - cycles per instruction: these processors are 4-way superscalar so
     have a best-case CPI of 0.25 with a perfect mix of floating
     point and integer operations. Given that few of the benchmarks
     use floating point, ideal CPI is 0.5. The best attained by CMUCL
     is 0.8.

   - rate of instruction-cache misses, and rate of stalls due to
     icache misses

   - rate of L2 cache misses

   - rate of load-use stalls and stalls due to mispredicted branches
  
Unfortunately, in some cases the relevant events differ between US-II
and US-III. It is also possible (though more difficult) to obtain
other measures such as dcache miss/stall rates, but it requires
multiple runs observing different events.


I have appended some raw results. Most of these benchmarks are
included in the cl-bench suite, but I've added a few new tests:

   - COMPILER involves running the CMUCL compiler on a source file

   - WALK-LIST/SEQ and WALK-LIST/MESS are inspired by the code in
     Fateman's cachelisp paper; both walk a list of 2M fixnums,
     sequentially allocated in the former case and merge-sorted in the
     latter case (to spread pointers all over memory)

The columns are IPC, icache miss rate, icache stall rate, ecache
miss rate, load-use stall rate. All the rates are percentages of
total number of cycles in user space, for an execution of between 1
and 3 seconds, with full gc between each test. 
     

=== Dual UltraSPARC III @ 700 MHz, 32kB icache, 65kB dcache, 8MB ecache ===

;; COMPILER                   2.09   [i: 17.2 34.8 ]  [e:  1.1]   6.4
;; LOAD-FASL                  1.51   [i:  8.6 20.3 ]  [e:  0.8]  11.0
;; PERMUTATIONS               1.44   [i:  7.5  1.3 ]  [e:  1.0]   6.8
;; WALK-LIST/SEQ              2.08   [i:  0.4  0.1 ]  [e:  9.9]   4.3
;; WALK-LIST/MESS             6.69   [i:  2.2  0.0 ]  [e: 27.3]   1.8
;; BOYER                      1.44   [i: 10.2  1.1 ]  [e:  0.3]   8.5
;; BROWSE                     1.48   [i:  6.6  3.5 ]  [e:  0.6]   7.5
;; DDERIV                     2.42   [i:  9.0  3.2 ]  [e:  2.2]   6.0
;; DERIV                      2.43   [i:  9.8  3.1 ]  [e:  2.2]   6.2
;; DESTRUCTIVE                1.47   [i: 12.1  1.3 ]  [e:  0.6]   5.5
;; DIV2-TEST-1                3.40   [i:  2.0  2.5 ]  [e:  2.3]   5.2
;; DIV2-TEST-2                2.41   [i:  7.9  2.0 ]  [e:  1.2]   5.2
;; FFT                        1.26   [i:  2.2  2.0 ]  [e:  0.0]   6.2
;; FRPOLY/FIXNUM              1.74   [i: 14.5  1.7 ]  [e:  0.4]   9.1
;; FRPOLY/BIGNUM              1.35   [i:  6.9  3.6 ]  [e:  0.4]  10.4
;; FRPOLY/FLOAT               1.59   [i: 10.0  1.6 ]  [e:  0.3]  13.1
;; PUZZLE                     0.94   [i:  1.7  1.7 ]  [e:  0.0]  20.5
;; TAK                        1.00   [i: 10.7  0.5 ]  [e:  0.0]   9.8
;; CTAK                       1.29   [i:  0.1  0.5 ]  [e:  0.0]   8.1
;; TRTAK                      1.04   [i:  0.1 12.8 ]  [e:  0.0]   3.6
;; TAKL                       1.12   [i:  4.3  0.1 ]  [e:  0.0]   3.0
;; STAK                       1.25   [i:  5.6  0.2 ]  [e:  0.0]  19.9
;; FPRINT/UGLY                1.04   [i:  4.7  3.6 ]  [e:  0.0]  15.2
;; FPRINT/PRETTY              1.47   [i: 15.1 29.6 ]  [e:  0.1]  10.3
;; TRAVERSE                   1.44   [i:  5.6  0.2 ]  [e:  0.2]   7.3
;; TRIANGLE                   0.96   [i:  4.5  0.8 ]  [e:  0.0]  18.2
;; RICHARDS                   1.43   [i: 15.3  0.9 ]  [e:  0.0]  14.1
;; FACTORIAL                  1.91   [i:  6.2 16.0 ]  [e:  1.3]  12.2
;; FIB                        1.57   [i: 22.7  0.6 ]  [e:  0.0]   2.7
;; FIB-RATIO                  0.95   [i:  3.6  1.4 ]  [e:  0.1]   9.0
;; ACKERMANN                  1.73   [i:  9.5  0.3 ]  [e:  0.1]   4.4
;; MANDELBROT                 1.41   [i:  6.5 13.8 ]  [e:  0.6]   7.4
;; MRG32K3A                   0.92   [i:  0.0  0.0 ]  [e:  0.0]  13.1
;; CRC40                      1.08   [i:  6.1  1.6 ]  [e:  0.7]   9.0
;; BIGNUM/ELEM-100-1000       0.96   [i:  3.4  1.6 ]  [e:  0.1]   9.8
;; BIGNUM/ELEM-1000-100       0.89   [i:  0.8  0.3 ]  [e:  0.0]  11.4
;; BIGNUM/ELEM-10000-1        0.95   [i:  0.1  0.1 ]  [e:  0.0]  14.1
;; BIGNUM/PARI-100-10         0.84   [i:  2.4  0.6 ]  [e:  0.0]  10.4
;; BIGNUM/PARI-200-5          0.81   [i:  1.0  0.3 ]  [e:  0.0]  11.4
;; PI-DECIMAL/SMALL           0.87   [i:  1.9  1.1 ]  [e:  0.1]  10.7
;; PI-DECIMAL/BIG             0.85   [i:  1.2  0.6 ]  [e:  0.1]  11.5
;; PI-ATAN                    1.55   [i:  3.6  6.0 ]  [e:  1.2]   8.7
;; PI-RATIOS                  0.93   [i:  3.4  2.8 ]  [e:  0.1]   9.4
;; SLURP-LINES                1.54   [i:  8.6 23.4 ]  [e:  0.7]  10.5
;; HASH-STRINGS               1.01   [i:  5.0  0.7 ]  [e:  0.2]  12.6
;; HASH-INTEGERS              1.14   [i:  4.5  1.1 ]  [e:  0.5]  10.6
;; BOEHM-GC                   1.88   [i: 10.9  2.4 ]  [e:  2.8]   4.6
;; DEFLATE-FILE               1.00   [i:  2.9  2.4 ]  [e:  0.1]  14.7
;; 1D-ARRAYS                  0.95   [i:  5.5  0.9 ]  [e:  0.1]  18.3
;; 2D-ARRAYS                  1.07   [i:  2.5  0.1 ]  [e:  0.4]  22.2
;; 3D-ARRAYS                  0.89   [i:  3.5  0.2 ]  [e:  0.2]  22.7
;; BITVECTORS                 1.27   [i:  0.1  0.5 ]  [e:  1.5]   8.2
;; BENCH-STRINGS              1.26   [i: 13.4  0.3 ]  [e:  0.1]  10.9
;; fill-strings/adjustable    0.90   [i:  6.3  0.3 ]  [e:  0.0]  16.8
;; STRING-CONCAT              1.05   [i:  6.0  0.6 ]  [e:  0.2]  10.7
;; SEARCH-SEQUENCE            1.18   [i:  8.2  1.5 ]  [e:  0.3]  13.1



=== Dual UltraSPARC IIIi @ 1GHz, 32kB icache, 65kB dcache, 1MB ecache ===

;; COMPILER                   2.14   [i: 14.5 38.8 ]  [e:  1.7]   5.8
;; LOAD-FASL                  1.55   [i:  7.6 18.4 ]  [e:  0.8]  11.8
;; PERMUTATIONS               1.30   [i:  7.2  1.2 ]  [e:  1.0]   6.9
;; WALK-LIST/SEQ              2.24   [i:  0.0  0.0 ]  [e: 12.5]   3.8
;; WALK-LIST/MESS            11.77   [i:  2.1  0.0 ]  [e: 67.4]   0.8
;; BOYER                      1.43   [i:  9.5  0.9 ]  [e:  0.3]   8.5
;; BROWSE                     1.53   [i:  4.5  2.8 ]  [e:  0.6]   8.0
;; DDERIV                     2.62   [i:  7.4  2.3 ]  [e:  2.2]   5.4
;; DERIV                      2.68   [i:  8.5  2.5 ]  [e:  2.2]   5.7
;; DESTRUCTIVE                1.51   [i: 12.9  0.9 ]  [e:  0.6]   5.1
;; DIV2-TEST-1                3.60   [i:  1.9  2.0 ]  [e:  2.3]   4.8
;; DIV2-TEST-2                2.47   [i:  7.4  1.6 ]  [e:  1.2]   5.0
;; FFT                        1.22   [i:  1.2  0.7 ]  [e:  0.0]   6.4
;; FRPOLY/FIXNUM              1.72   [i: 13.0  1.5 ]  [e:  0.4]   8.7
;; FRPOLY/BIGNUM              1.38   [i:  6.7  3.3 ]  [e:  0.4]  10.1
;; FRPOLY/FLOAT               1.59   [i:  9.2  1.3 ]  [e:  0.3]  13.1
;; PUZZLE                     1.01   [i:  2.2  1.6 ]  [e:  0.0]  20.4
;; TAK                        0.98   [i:  6.0  0.4 ]  [e:  0.0]   3.5
;; CTAK                       1.24   [i:  7.6  0.5 ]  [e:  0.0]   8.1
;; TRTAK                      0.99   [i:  7.8  2.6 ]  [e:  0.0]   5.6
;; TAKL                       1.07   [i:  3.8  0.1 ]  [e:  0.0]   4.3
;; STAK                       1.37   [i:  4.8  0.1 ]  [e:  0.0]  18.7
;; FPRINT/UGLY                1.03   [i:  5.2  2.7 ]  [e:  0.0]  15.4
;; FPRINT/PRETTY              1.37   [i: 11.5 26.1 ]  [e:  0.1]  10.8
;; TRAVERSE                   1.38   [i:  5.3  0.1 ]  [e:  0.0]   7.2
;; TRIANGLE                   0.98   [i:  5.2  0.0 ]  [e:  0.0]  17.9
;; RICHARDS                   1.51   [i: 13.0  0.7 ]  [e:  0.0]  14.9
;; FACTORIAL                  1.85   [i:  8.3 12.2 ]  [e:  1.3]  13.2
;; FIB                        1.59   [i: 18.6  0.5 ]  [e:  0.0]   2.5
;; FIB-RATIO                  0.97   [i:  5.6  1.2 ]  [e:  0.2]   9.1
;; ACKERMANN                  1.73   [i: 17.8  0.9 ]  [e:  0.0]   1.4
;; MANDELBROT                 1.37   [i:  9.3 11.4 ]  [e:  0.6]   6.7
;; MRG32K3A                   0.90   [i:  0.0  0.0 ]  [e:  0.0]  13.3
;; CRC40                      1.08   [i:  5.7  1.3 ]  [e:  0.7]   9.1
;; BIGNUM/ELEM-100-1000       0.97   [i:  3.5  1.4 ]  [e:  0.2]   9.7
;; BIGNUM/ELEM-1000-100       0.88   [i:  0.8  0.3 ]  [e:  0.0]  11.4
;; BIGNUM/ELEM-10000-1        0.93   [i:  0.1  0.0 ]  [e:  0.0]  13.6
;; BIGNUM/PARI-100-10         0.84   [i:  2.6  0.5 ]  [e:  0.0]  10.3
;; BIGNUM/PARI-200-5          0.81   [i:  1.1  0.2 ]  [e:  0.0]  11.3
;; PI-DECIMAL/SMALL           0.87   [i:  2.4  1.0 ]  [e:  0.1]  10.6
;; PI-DECIMAL/BIG             0.85   [i:  1.2  0.4 ]  [e:  0.0]  11.7
;; PI-ATAN                    1.51   [i:  3.8  5.1 ]  [e:  1.1]   8.8
;; PI-RATIOS                  0.94   [i:  4.4  2.4 ]  [e:  0.1]   9.1
;; SLURP-LINES                1.37   [i:  7.7 14.4 ]  [e:  0.2]  11.8
;; HASH-STRINGS               1.00   [i:  5.1  0.8 ]  [e:  0.2]  12.1
;; HASH-INTEGERS              1.19   [i:  1.8  0.9 ]  [e:  0.8]   9.8
;; BOEHM-GC                   1.89   [i:  9.5  1.8 ]  [e:  2.8]   4.7
;; DEFLATE-FILE               1.02   [i:  5.1  1.6 ]  [e:  0.1]  14.7
;; 1D-ARRAYS                  0.96   [i:  5.7  0.7 ]  [e:  0.0]  19.0
;; 2D-ARRAYS                  1.10   [i:  2.6  0.0 ]  [e:  0.5]  20.9
;; 3D-ARRAYS                  1.02   [i:  3.1  0.0 ]  [e:  0.3]  19.5
;; BITVECTORS                 1.46   [i:  0.1  0.5 ]  [e:  1.6]   7.1
;; BENCH-STRINGS              1.29   [i: 11.9  0.0 ]  [e:  0.2]  10.8
;; fill-strings/adjustable    0.90   [i:  6.0  0.6 ]  [e:  0.0]  16.9
;; STRING-CONCAT              0.93   [i:  6.2  0.4 ]  [e:  0.2]  11.5
;; SEARCH-SEQUENCE            1.17   [i:  5.9  1.2 ]  [e:  0.3]  13.7


=== UltraSPARC-IIe @ 500 MHz, 16kB icache, 16kB dcache, 256kB L2 cache ===

;; COMPILER                   2.79   [i: 15.6 37.4 ]  [e: 59.2]  21.1
;; LOAD-FASL                  1.89   [i:  9.3 24.0 ]  [e: 46.2]  23.8
;; PERMUTATIONS               1.19   [i:  0.7  1.8 ]  [e: 19.0]  25.2
;; WALK-LIST/SEQ              1.65   [i:  0.0  0.1 ]  [e: 57.2]  69.6
;; WALK-LIST/MESS             6.18   [i:  0.0  2.5 ]  [e: 91.3]  84.7
;; BOYER                      1.49   [i:  3.1  6.6 ]  [e: 36.1]  22.0
;; BROWSE                     1.34   [i:  2.2  5.7 ]  [e: 30.5]  25.6
;; DDERIV                     1.76   [i:  1.5  4.0 ]  [e: 31.0]  30.8
;; DERIV                      1.79   [i:  1.5  3.8 ]  [e: 25.9]  24.2
;; DESTRUCTIVE                1.19   [i:  0.3  1.1 ]  [e: 17.2]  11.9
;; DIV2-TEST-1                1.98   [i:  1.6  4.1 ]  [e: 40.8]  55.9
;; DIV2-TEST-2                1.60   [i:  0.9  2.9 ]  [e: 28.1]  43.1
;; FFT                        1.12   [i:  0.1  0.3 ]  [e:  9.0]  14.1
;; FRPOLY/FIXNUM              1.56   [i:  0.9  2.4 ]  [e: 22.0]  18.1
;; FRPOLY/BIGNUM              1.55   [i:  6.6 12.4 ]  [e: 43.2]  15.8
;; FRPOLY/FLOAT               1.40   [i:  1.8  4.0 ]  [e: 30.8]  23.2
;; PUZZLE                     0.96   [i:  0.4  1.8 ]  [e: 22.5]  23.2
;; TAK                        1.17   [i:  0.1  0.5 ]  [e:  1.4]  20.5
;; CTAK                       1.19   [i:  0.2  0.7 ]  [e:  1.1]  18.7
;; TRTAK                      1.17   [i:  0.1  0.4 ]  [e:  1.6]  18.4
;; TAKL                       0.79   [i:  0.0  0.1 ]  [e:  0.9]   2.7
;; STAK                       1.10   [i:  0.1  0.2 ]  [e:  0.2]  20.9
;; FPRINT/UGLY                1.54   [i: 10.2 18.4 ]  [e: 59.8]  15.8
;; FPRINT/PRETTY              1.97   [i: 20.4 30.9 ]  [e: 65.3]  13.2
;; TRAVERSE                   1.69   [i:  0.0  0.2 ]  [e: 19.1]  52.4
;; TRIANGLE                   0.84   [i:  0.1  0.3 ]  [e:  0.9]  16.9
;; RICHARDS                   1.55   [i:  0.0  0.0 ]  [e:  1.0]  19.3
;; FACTORIAL                  2.07   [i: 10.4 15.4 ]  [e: 54.4]   9.6
;; FIB                        1.32   [i:  0.0  0.1 ]  [e:  0.3]   0.9
;; FIB-RATIO                  1.02   [i:  1.0  3.3 ]  [e: 12.4]  11.7
;; ACKERMANN                  1.38   [i:  0.0  0.0 ]  [e: 24.5]  15.2
;; MANDELBROT                 1.66   [i: 12.1 24.6 ]  [e: 54.3]  12.5
;; MRG32K3A                   0.87   [i:  0.0  0.0 ]  [e:  0.1]  23.3
;; CRC40                      1.10   [i:  2.3  5.6 ]  [e: 25.0]  13.6
;; BIGNUM/ELEM-100-1000       0.95   [i:  0.6  2.2 ]  [e:  9.3]  16.7
;; BIGNUM/ELEM-1000-100       0.87   [i:  0.1  0.3 ]  [e:  2.8]  25.8
;; BIGNUM/ELEM-10000-1        1.02   [i:  0.0  0.1 ]  [e: 10.3]  27.0
;; BIGNUM/PARI-100-10         0.82   [i:  0.1  0.5 ]  [e:  2.0]  20.1
;; BIGNUM/PARI-200-5          0.76   [i:  0.0  0.2 ]  [e:  1.1]  26.2
;; PI-DECIMAL/SMALL           0.84   [i:  0.2  1.1 ]  [e:  3.9]  20.6
;; PI-DECIMAL/BIG             0.81   [i:  0.1  0.6 ]  [e:  2.5]  24.4
;; PI-ATAN                    1.52   [i:  3.2  6.9 ]  [e: 27.1]  15.2
;; PI-RATIOS                  0.95   [i:  0.8  3.1 ]  [e: 12.1]  16.1
;; SLURP-LINES                1.47   [i: 10.5 20.7 ]  [e: 54.5]  12.0
;; HASH-STRINGS               1.07   [i:  1.2  3.2 ]  [e: 29.3]  17.1
;; HASH-INTEGERS              1.27   [i:  1.0  1.6 ]  [e: 23.3]  26.3
;; BOEHM-GC                   1.48   [i:  0.8  2.9 ]  [e: 24.6]  17.2
;; DEFLATE-FILE               0.97   [i:  0.2  1.9 ]  [e: 10.3]  15.9
;; 1D-ARRAYS                  0.97   [i:  0.0  0.1 ]  [e:  2.5]  18.6
;; 2D-ARRAYS                  1.12   [i:  0.0  0.1 ]  [e:  6.4]  28.5
;; 3D-ARRAYS                  0.82   [i:  0.0  0.1 ]  [e:  1.1]  20.3
;; BITVECTORS                 3.35   [i:  0.2  0.5 ]  [e: 42.9]  84.1
;; BENCH-STRINGS              1.17   [i:  0.0  0.0 ]  [e:  5.5]  11.3
;; fill-strings/adjustable    0.92   [i:  0.0  0.1 ]  [e:  2.8]  17.0
;; STRING-CONCAT              1.19   [i:  4.1  8.1 ]  [e: 40.1]  17.1
;; SEARCH-SEQUENCE            1.07   [i:  0.5  1.2 ]  [e: 15.5]  19.7

-- 
Eric Marsden                          <URL:http://www.laas.fr/~emarsden/>
From: Joe Marshall
Subject: Re: reducing number consing
Date: 
Message-ID: <k74v7xwl.fsf@comcast.net>
Raymond Toy <···@rtp.ericsson.se> writes:

> Shouldn't you profile it with real stuff instaed of test code before
> micro-optimizing this? 
>
> Luke Gorrie was saying that in one of his ethernet switch (?) 
> programs, he was consing for every single ethernet packet.  It turned
> out that it didn't really matter, and generational GC took care of the
> consing quite well.

With a good generational GC (and the appropriate tuning), you can cons
garbage at a phenomenal rate for practically nothing.  One program
that I have chews through several gigabytes in a few seconds with very
little GC overhead.  Basically, the generational GC kicks in, notices
that absolutely nothing points at the latest generation, and just
resets the consing frontier to the beginning.

This isn't to say that you can *always* get away with this, nor that
it is *never* worthwhile to handle memory management manually, but in
my experience it is almost never worthwhile worrying about consing
stuff that will be immediately thrown away.

Incidentally, if you are consing stuff that you are *not* throwing
away, then you will want to be concerned about the GC.  A generational
GC will end up promoting the structures through the various
generations unnecessarily.


-- 
~jrm
From: William D Clinger
Subject: Re: reducing number consing
Date: 
Message-ID: <fb74251e.0312191244.73dd35a5@posting.google.com>
Cliff Crawford <·····@cornell.edu> wrote:
> |  Why?  Is GC an issue here?
> 
> I'm worried it will be, because in typical usage it will get called
> 100,000 - 1,000,000 times.

At 32 bytes per call, that's 32 megabytes of short-lived data.
In all likelihood, you're worrying about less than one second
of gc time, probably much less.

Will
From: Luke Gorrie
Subject: Re: reducing number consing
Date: 
Message-ID: <lhptek1np7.fsf@dodo.bluetail.com>
··········@verizon.net (William D Clinger) writes:

> Cliff Crawford <·····@cornell.edu> wrote:
> > |  Why?  Is GC an issue here?
> > 
> > I'm worried it will be, because in typical usage it will get called
> > 100,000 - 1,000,000 times.
> 
> At 32 bytes per call, that's 32 megabytes of short-lived data.
> In all likelihood, you're worrying about less than one second
> of gc time, probably much less.

While we're being quantitative, in a typical SBCL setup this would
require about three garbage collections (one for each 12MB allocated),
and each would take something like 10ms on a 2Ghz pentium PC if the
data is very short-lived. So "probably less than one second" is
putting it mildly.

But if the urge to optimise memory management is strong, we're in
luck, because it looks like the garbage collector could be optimised
considerably further for these cases. And what could be higher in
optimisation-glory than hacking the GC?

I must disclaim that probably very few programs would benefit from
this, and one should check with TIME to see. But optimisation should
benefit programs that generate a truly phenomenal amount of garbage,
or programs don't work well with frequent pauses of ~10ms.

Joe Marshall's nice description was:

  Basically, the generational GC kicks in, notices that absolutely
  nothing points at the latest generation, and just resets the consing
  frontier to the beginning.

But on CMUCL or SBCL, it seems to be more like this:

  The generational GC kicks in, notices that nothing in older
  generations points at the latest generation, scavenges static space
  to update any pointers into the latest generation, and then just
  sets the consing frontier to the beginning.

That is to say, the GC doesn't know whether anything in "static" space
(where non-movable objects can be allocated) has been modified, so
during each collection it "scavenges" that memory to update pointers
to moved objects. The other dynamic-space generations have a
"write-barrier" to keep track of what has and hasn't been modified,
but the static space does not.

Static space appears to typically be around 4MB, and a significant
amount of time seems to be spent scavenging it (based on profiling).

Ideas have been floated about fixing this, for instance moving most of
the objects in static space into a "tenured" generation of dynamic
space (where it gets the write-barrier) that doesn't get
collected. Whoever rolls up their sleeves and hacks it will have very
good karma indeed :-)

More details can be found in the cmucl-imp mailing list archives.

It also appears that on some systems (i.e. my laptop) the garbage
collector causes a great deal of kernel work. Smarter use of system
calls may fix this, e.g. coalescing mprotect()s of adjacent pages. You
can see if your system suffers this problem by running this in the
top-level:

  (defun foo () (make-array 1024))
  (compile 'foo)
  (time (dotimes (i (expt 10 6)) (foo)))

On my laptop this reports that most of the time was spent in the
system (i.e. kernel), but on my desktop machine it doesn't. Both are
running Linux/x86, but different kernels and different CMUCL versions.

Cheers,
Luke
From: Joe Marshall
Subject: Re: reducing number consing
Date: 
Message-ID: <65gc9yhi.fsf@comcast.net>
Luke Gorrie <····@bluetail.com> writes:

> It also appears that on some systems (i.e. my laptop) the garbage
> collector causes a great deal of kernel work. Smarter use of system
> calls may fix this, e.g. coalescing mprotect()s of adjacent pages. 

Some experiments have shown that checking the write-barrier in
software by doing a conditional branch on every write can outperform
hardware checking based on mprotect. 

-- 
~jrm
From: Luke Gorrie
Subject: Re: reducing number consing
Date: 
Message-ID: <lh7k0s1h7w.fsf@dodo.bluetail.com>
Joe Marshall <·············@comcast.net> writes:

> Some experiments have shown that checking the write-barrier in
> software by doing a conditional branch on every write can outperform
> hardware checking based on mprotect. 

Interesting. Do you have any more details?

In particular I wonder what the bottleneck was with hardware checking.

-Luke
From: Joe Marshall
Subject: Re: reducing number consing
Date: 
Message-ID: <vfoc8bng.fsf@comcast.net>
Luke Gorrie <····@bluetail.com> writes:

> Joe Marshall <·············@comcast.net> writes:
>
>> Some experiments have shown that checking the write-barrier in
>> software by doing a conditional branch on every write can outperform
>> hardware checking based on mprotect. 
>
> Interesting. Do you have any more details?

@inproceedings{ hosking92comparative,
    author = "Antony L. Hosking and J. Eliot B. Moss and Darko Stefanovic",
    title = "A comparative performance evaluation of write barrier implementations",
    booktitle = "Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications ({OOPSLA})",
    journal = "SIGPLAN Notices",
    volume = "27",
    number = "10",
    publisher = "ACM Press",
    address = "New York, NY",
    editor = "Andreas Paepcke",
    pages = "92--109",
    year = "1992",
    url = "citeseer.nj.nec.com/hosking92comparative.html" }

> In particular I wonder what the bottleneck was with hardware checking.

Really slow interrupt handling.

-- 
~jrm
From: Luke Gorrie
Subject: Re: reducing number consing
Date: 
Message-ID: <lh7k0roqtx.fsf@dodo.bluetail.com>
Joe Marshall <·············@comcast.net> writes:

> Luke Gorrie <····@bluetail.com> writes:
> 
> > In particular I wonder what the bottleneck was with hardware checking.
> 
> Really slow interrupt handling.

Raymond Toy and I have been poking around at this in CMUCL.

For my part, I have reached the following shocking conclusions:

  Performance analysis is quite hard.
  Microbenchmarks are not very useful.
  Profilers are really wonderful.

This is based on "carefully" analysing the performance of the
following program:

  (defun foo (x y) (cons x y))
  (dotimes (i N) (foo 1 2))

Hopefully some tidbits of interesting, if not useful, CMUCL
information comes from this:

  For purely ultra-short-lived data, allocation is much more costly
  than garbage collection.

  The majority of time was spent in the kernel page-fault handler, but
  this has nothing to do with the write-barrier. At the end of garbage
  collection, the collector "zeros out" the oldspace by
  munmap/mmap'ing it. The kernel implements this lazily by assigning
  each page as a copy-on-write version of a reserved page of
  zeroes. The first time you write to each of these pages (by
  allocating a new cons) the kernel internally faults and zero's a
  page of real memory for you.

  If you compile CMUCL from source, make sure you build the runtime
  system with -O3 (which may not happen by default). For me this made
  garbage collection almost three times faster.

Of course we all know that profilers are wonderful, but Oprofile for
Linux is really really wonderful. It is convenient to use (no
instrumentation), profiles everything on your computer at once (all
programs, shared libraries, kernel, device drivers), and causes no
noticable overhead. If you ask it, "What's my computer doing while
CMUCL compiles itself?", it will promptly reply something like:

  samples  %        app name                 symbol name
  194960   73.5565  lisp.core                (no symbols)
  15843     5.9774  lisp                     scavenge
  9578      3.6137  lisp                     from_space_p
  8743      3.2986  vmlinux                  default_idle
  5810      2.1921  vmlinux                  do_anonymous_page
  4814      1.8163  cc1                      (no symbols)
  2415      0.9112  vmlinux                  do_page_fault

Okay, so it's a pity it can't find symbol information in the Lisp
image or stripped C programs. However, it is perfectly happy to
annotate the code (source or disassembly) of the kernel's
`do_anonymous_page' to let you see that it's spending its time writing
zeros into memory, for example.

Finally, though I've been very well brought up not to micro-optimize,
I couldn't help comparing the foo program to this bar program:

  (defvar my-cons (cons nil nil))
  (defun bar (x y) (setf (car my-cons) x) (setf (cdr my-cons) y) my-cons)
  (dotimes (i N) (bar 1 2))

I didn't write down my performance prediction in a sealed envelope,
but I think I was expecting it to be something like 20-50 times
faster. I was surprised to find it more like 500 times. So although
I'm not about to run and rewrite all my programs without using CONS, I
won't be quite so quick to scoff the next time someone tells me that
he really had to optimize the memory management in his program to get
the performance he needed.

I'm really a Watson in search of a Holmes in these matters, so I think
I'll leave it at that for now.

Tallyho,
Luke
From: Pascal Bourguignon
Subject: Re: reducing number consing
Date: 
Message-ID: <87oeu3hnye.fsf@thalassa.informatimago.com>
Luke Gorrie <····@bluetail.com> writes:
>   For purely ultra-short-lived data, allocation is much more costly
>   than garbage collection.
> 
>   The majority of time was spent in the kernel page-fault handler, but
>   this has nothing to do with the write-barrier. At the end of garbage
>   collection, the collector "zeros out" the oldspace by
>   munmap/mmap'ing it. 

So, here you have the bug in CMLCL!  
It should use memset(garbage,0,size) instead.


-- 
__Pascal_Bourguignon__                              .  *   * . * .* .
http://www.informatimago.com/                        .   *   .   .*
There is no worse tyranny than to force             * .  . /\  (   . *
a man to pay for what he does not                    . .  / .\   . * .
want merely because you think it                    .*.  / *  \  . .
would be good for him. -- Robert Heinlein             . /*   o \     .
http://www.theadvocates.org/                        *   '''||'''   .
SCO Spam-magnet: ··········@sco.com                 ******************
From: Luke Gorrie
Subject: Embarrassing correction (Was: reducing number consing)
Date: 
Message-ID: <lhy8t7n7zh.fsf_-_@dodo.bluetail.com>
Luke Gorrie <····@bluetail.com> writes:

> Finally, though I've been very well brought up not to micro-optimize,
> I couldn't help comparing the foo program to this bar program:
> 
>   (defvar my-cons (cons nil nil))
>   (defun bar (x y) (setf (car my-cons) x) (setf (cdr my-cons) y) my-cons)
>   (dotimes (i N) (bar 1 2))
> 
> I didn't write down my performance prediction in a sealed envelope,
> but I think I was expecting it to be something like 20-50 times
> faster. I was surprised to find it more like 500 times.

*Ahem*. :-)

I'm pleased to report that I botched this _completely_. You might be
well advised to ignore all that I've said, as my CMUCL source tree
appears to be rather badly broken.

I will refrain from reporting a "correction", lest I botch it up too
somehow. It's easy to measure for yourself.

Cons and be merry!

Cheers,
Luke
From: Joe Marshall
Subject: Re: reducing number consing
Date: 
Message-ID: <brq38o0w.fsf@comcast.net>
Luke Gorrie <····@bluetail.com> writes:

> Joe Marshall <·············@comcast.net> writes:
>
>> Luke Gorrie <····@bluetail.com> writes:
>> 
>> > In particular I wonder what the bottleneck was with hardware checking.
>> 
>> Really slow interrupt handling.
>
> Raymond Toy and I have been poking around at this in CMUCL.
>
> For my part, I have reached the following shocking conclusions:
>
>   Performance analysis is quite hard.
>   Microbenchmarks are not very useful.
>   Profilers are really wonderful.

I'll concur with these.

> Hopefully some tidbits of interesting, if not useful, CMUCL
> information comes from this:
>
>   For purely ultra-short-lived data, allocation is much more costly
>   than garbage collection.

I can believe this.  I don't know the details of CMUCL, but other
systems have a lot of overhead in allocation.

  Allocation from an arena shared across threads needs to be locked.
  The solution here is to give each thread its own top-level
  allocation arena.

  Many (or most) allocators check the allocation limit at each
  allocation.  One possible strategy is to use a separate thread to
  check the remaining room.  Of course this thread had better be very
  high priority.  Another is to protect a page at the end of the
  arena.

  Many systems rely on zeroed pages for GC safety.  Although this
  avoids problems with the GC discovering bad data between allocating
  the cell and filling it, the cell ends up being written twice.

  Some systems avoid the zeroing problem by having special arenas for
  certain kinds of ultra-short-lived data like floats and bignums.

  Finally, the few instructions used for allocation tend to really
  exercise the memory bus.  Hand scheduling the instructions can
  produce a noticable benefit. 

>   The majority of time was spent in the kernel page-fault handler, but
>   this has nothing to do with the write-barrier. At the end of garbage
>   collection, the collector "zeros out" the oldspace by
>   munmap/mmap'ing it. The kernel implements this lazily by assigning
>   each page as a copy-on-write version of a reserved page of
>   zeroes. The first time you write to each of these pages (by
>   allocating a new cons) the kernel internally faults and zero's a
>   page of real memory for you.

If the page really needs to be zeroed, it'd probably be better to do
so in the process rather than in the kernel.

> So although
> I'm not about to run and rewrite all my programs without using CONS, I
> won't be quite so quick to scoff the next time someone tells me that
> he really had to optimize the memory management in his program to get
> the performance he needed.

Sometimes you need to optimize memory management.  But it is far
better to check the performance first.

-- 
~jrm
From: Luke Gorrie
Subject: Re: reducing number consing
Date: 
Message-ID: <lhbrq1zd1p.fsf@dodo.bluetail.com>
Hi Joe,

Thanks for the references.

> If the page really needs to be zeroed, it'd probably be better to do
> so in the process rather than in the kernel.

One advantage of doing it in the kernel is that it happens lazily
rather than in the GC, so GCs don't "stop the world" for as long.

The performance difference doesn't seem very significant on CMUCL with
Linux/x86. People interested in playing around can use this snippet to
make the GC zero out memory with the CPU:

  (alien:def-alien-variable ("gencgc_unmap_zero" gencgc-unmap-zero) c-call::int)
  (setq gencgc-unmap-zero 0)

> Sometimes you need to optimize memory management.  But it is far
> better to check the performance first.

And to check it carefully. I'm amazed that after getting into
"micro-optimize" mode - disassembling the kernel, thinking about
protection faults, etc - everything started to feel so expensive that
I swallowed benchmark results that were off by at least two orders of
magnitude. It is all too easy to disengage one's brain in such
situations, at least for me.

The results I posted (and have tried to retract) drastically
exaggerate the cost of zeroing memory. I had, uh, somehow broken my
CMUCL such that calling CONS would allocate several kilobytes of
memory, thus triggering very frequent collections (and
re-initializations) of mostly-unused memory.

The reality seems to be that zero'ing and GC'ing is quite cheap
compared with actually doing something (e.g. poking CARs and CDRs
into) all of that memory. Emphasis on "seems" of course. :-)

Cheers,
Luke
From: Michael Livshin
Subject: Re: reducing number consing
Date: 
Message-ID: <s3r7z0e3c8.fsf@cmm.kakpryg.net.cmm>
Joe Marshall <·············@comcast.net> writes:

> Luke Gorrie <····@bluetail.com> writes:
>
> Some experiments have shown that checking the write-barrier in
> software by doing a conditional branch on every write can outperform
> hardware checking based on mprotect. 

that's very system-dependent.  conventional wisdom holds, for
instance, that on Linux signals (and context switches in general) are
very fast, so there the mprotect-based approach can be a win.  I'm not
sure how much one can trust conventional wisdom here, though.  are
signals as fast on SMP systems as they are on single-CPU ones?  is
this still true for reasonably current kernel/libc versions?  I don't
know.

what was the experiments you talk about done on, if it's not a secret?

-- 
it takes more than not to remember how you did it the last time to be
innovative.                                            -- Erik Naggum
From: Duane Rettig
Subject: Re: reducing number consing
Date: 
Message-ID: <4k74sxh6d.fsf@franz.com>
Michael Livshin <······@cmm.kakpryg.net> writes:

> Joe Marshall <·············@comcast.net> writes:
> 
> > Luke Gorrie <····@bluetail.com> writes:
> >
> > Some experiments have shown that checking the write-barrier in
> > software by doing a conditional branch on every write can outperform
> > hardware checking based on mprotect. 
> 
> that's very system-dependent.  conventional wisdom holds, for
> instance, that on Linux signals (and context switches in general) are
> very fast, so there the mprotect-based approach can be a win.  I'm not
> sure how much one can trust conventional wisdom here, though.  are
> signals as fast on SMP systems as they are on single-CPU ones?  is
> this still true for reasonably current kernel/libc versions?  I don't
> know.

It depends on what you call "fast".  Is 100 - 200 cycles fast?
Probably, because it is certainly faster than you can blink.
But if you compare that with the 5-10 cycles that a software
test might take, then we're dealing with molasses...

Why does trap handling take so much longer than software handling?
Well, a trap handler (especially one which executes user code
in order to determine the behavior) must be general; it must
set up protections so that the user trap handling cannot destroy
the system by bug or by malfeasance, and it must save as much of
the current context (including any used registers) as is necessary
in order to restore the state of the program after the trap.
On the other hand, a software barrier can be optimized by the
compiler to only save those caller-saves regsisters that are
thus volatile, and since there is no context switch, there is
no protection change.  The software can get in and out quickly.

On the third hand, I have often raised some ruckus in comp.arch
to ask them there for fast user-level trap handling as the 
best way for general-purpose computer architectures to support
garbage-collected langauges.  If such an architecture (plus
supporting kernel) were to ever be available in a general
purpose operating system, then even if such a conditional
trap were to take 30-50 cycles, it would still be worthwhile
to pursue it to replace the 5-10 cycle software barrier, since
software barriers tend to never be zero-cost, but traps, when
not taken, do tend to be zero cost, so the challenge is to find
the best algorithm that minimizes the number of times a trap is
actually taken.


-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: Rob Warnock
Subject: Re: reducing number consing
Date: 
Message-ID: <iIWdnSGgNeaPSnuiRVn-sw@speakeasy.net>
Duane Rettig  <·····@franz.com> wrote:
+---------------
| On the third hand, I have often raised some ruckus in comp.arch
| to ask them there for fast user-level trap handling as the 
| best way for general-purpose computer architectures to support
| garbage-collected langauges.  If such an architecture (plus
| supporting kernel) were to ever be available in a general
| purpose operating system, then even if such a conditional
| trap were to take 30-50 cycles, it would still be worthwhile
| to pursue it to replace the 5-10 cycle software barrier, since
| software barriers tend to never be zero-cost, but traps, when
| not taken, do tend to be zero cost, so the challenge is to find
| the best algorithm that minimizes the number of times a trap is
| actually taken.
+---------------

But notice that the Hosking, et al, paper I cited in a parallel reply
made the point that it's not necessarily the cost of the trap that's
the killer, but the large size of a hardware page [4 KB or larger]
compared to the optimum card size [256-512 bytes]. With a page-trapping
system, at GC time you have to scan *all* of each page that got stored
into, even if only one word got stored per page.

Unfortunately, as memory sizes grow, pressure on TLB sizes will only
push hardware page sizes higher & higher. As I noted in the other
reply, some (non-PC) servers already default to 16 KB pages, and in
some applications use *much* larger hardware page sizes -- *megabytes*,
even!! -- to avoid TLB thrashing.[1] In those situations, software write
barriers once again become a big win.


-Rob

[1] SGI's Origin systems, running Irix on MIPS, support up to *16 MB*
    pages for large HPC apps. This is aided enormously by the fact that
    the MIPS TLB supports multiple page sizes in a single process, so
    that not all shared-library pages have to be huge just because the
    app is using large pages in some other part of its address space.

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Raymond Toy
Subject: Re: reducing number consing
Date: 
Message-ID: <4nwu8o7vwj.fsf@edgedsp4.rtp.ericsson.se>
>>>>> "Rob" == Rob Warnock <····@rpw3.org> writes:

    Rob> Duane Rettig  <·····@franz.com> wrote:
    Rob> +---------------
    Rob> | On the third hand, I have often raised some ruckus in comp.arch
    Rob> | to ask them there for fast user-level trap handling as the 
    Rob> | best way for general-purpose computer architectures to support
    Rob> | garbage-collected langauges.  If such an architecture (plus
    Rob> | supporting kernel) were to ever be available in a general
    Rob> | purpose operating system, then even if such a conditional
    Rob> | trap were to take 30-50 cycles, it would still be worthwhile
    Rob> | to pursue it to replace the 5-10 cycle software barrier, since
    Rob> | software barriers tend to never be zero-cost, but traps, when
    Rob> | not taken, do tend to be zero cost, so the challenge is to find
    Rob> | the best algorithm that minimizes the number of times a trap is
    Rob> | actually taken.
    Rob> +---------------

    Rob> But notice that the Hosking, et al, paper I cited in a parallel reply
    Rob> made the point that it's not necessarily the cost of the trap that's
    Rob> the killer, but the large size of a hardware page [4 KB or larger]
    Rob> compared to the optimum card size [256-512 bytes]. With a page-trapping
    Rob> system, at GC time you have to scan *all* of each page that got stored
    Rob> into, even if only one word got stored per page.

Just a side note on an experiment I just did with CMUCL.  Generational
GC was recently ported to cmucl for the Sparc archictecture.  I ran
Eric Marsden's CL benchmark suite on both the non-gencgc version and
the gencgc version.  Most benchmarks were roughly equal, but a few
showed that gencgc was 3 times slower.

As an experiment, I changed the page size used by the gencgc
algorithm.  It was 8K, but I changed it to 64K.  According to the
benchmarks, gencgc was much improved, and only a little (25%?) slower
than the non-gencgc version.

Note, however, that this gencgc implementation works by doing a kernel
trap every once in a while to allocate memory when the inline
allocator runs out of space.  By changing the page size from 8K to
64K, the kernel trap would get called much less often.  Also, the
inline allocator in the gencgc is 2-3 times more expensive than the
non-gencgc allocator, which takes about 2-3 instructions to allocate
memory.

Anyway, just a note on some simple experiments I did.

Ray
From: Luke Gorrie
Subject: Re: reducing number consing
Date: 
Message-ID: <lh1xqxx7fn.fsf@dodo.bluetail.com>
····@rpw3.org (Rob Warnock) writes:

> But notice that the Hosking, et al, paper I cited in a parallel reply
> made the point that it's not necessarily the cost of the trap that's
> the killer, but the large size of a hardware page [4 KB or larger]
> compared to the optimum card size [256-512 bytes]. With a page-trapping
> system, at GC time you have to scan *all* of each page that got stored
> into, even if only one word got stored per page.

What seems most important to me is: how do I know when it matters?

Really, I already know that for myself it's so rarely that it's not
even funny. Perhaps once every second year I write a program where +/-
10% performance would be interesting. Even when I do, the "inner-loop"
is usually in the operating system somewhere.

What I'd like is to compile a "Read Before Optimising" sanity
check-list, just to put things in perspective when
premature-optimisation fever sets in. I'm hereby soliciting pointed
slogans :-)

Here are a few ideas to start with:

  Worried about taking write-barrier interrupts when you modify old
  variables?
  Remember, your Linux box is already context-switching on
  timer-interrupts one thousand times per second. Why do you think it
  won't be lost in the noise?

  Worried about dirtying some pages that will need to be scavenged?
  Remember, your MP3 player is decompressing and then shipping
  200KB/second to your second card. When's the last time you turned
  that off to make your computer run faster?

  Worried that consing will slow you down?
  Remember, your generational garbage collector was written specially
  so that you could create as much garbage as you like. Exactly how
  much speed are you expecting to gain, and how do you plan to measure
  it?

Any takers? I think a nicely cl-typeset quick-reference card could be
just the thing for guys like me.

I'd be interested to read a complementary card written by people who
really do write every-cycle-counts programs all day long too. Is "turn
off your MP3 player" the first law of serious optimization? ;-)

Cheers,
Luke
From: Larry Clapp
Subject: Re: reducing number consing
Date: 
Message-ID: <slrnbuk5d9.81p.larry@theclapp.ddts.net>
In article <··············@dodo.bluetail.com>, Luke Gorrie wrote:
> What I'd like is to compile a "Read Before Optimising" sanity
> check-list, just to put things in perspective when
> premature-optimisation fever sets in. I'm hereby soliciting pointed
> slogans :-)

Though for Perl, I've found this one fairly helpful:

http://magnonel.guild.net/~schwern/talks/How_To_Be_Lazy/full_slides/rules_of_optimization.html

My favorite rules of optimization (excerpted from above):
Rule #1: Don't do it.
Rule #2: Don't do it yet.

-- 
Larry Clapp / ·····@theclapp.org
From: Joe Marshall
Subject: Re: reducing number consing
Date: 
Message-ID: <hdzpal1r.fsf@comcast.net>
Larry Clapp <·····@theclapp.org> writes:

> In article <··············@dodo.bluetail.com>, Luke Gorrie wrote:
>> What I'd like is to compile a "Read Before Optimising" sanity
>> check-list, just to put things in perspective when
>> premature-optimisation fever sets in. I'm hereby soliciting pointed
>> slogans :-)
>
> Though for Perl, I've found this one fairly helpful:
>
> http://magnonel.guild.net/~schwern/talks/How_To_Be_Lazy/full_slides/rules_of_optimization.html
>
> My favorite rules of optimization (excerpted from above):
> Rule #1: Don't do it.
> Rule #2: Don't do it yet.

Rule #3:  Whatever you do, don't do it in Perl.


-- 
~jrm
From: Joe Marshall
Subject: Re: reducing number consing
Date: 
Message-ID: <r7z08bgk.fsf@comcast.net>
Michael Livshin <······@cmm.kakpryg.net> writes:

> Joe Marshall <·············@comcast.net> writes:
>>
>> Some experiments have shown that checking the write-barrier in
>> software by doing a conditional branch on every write can outperform
>> hardware checking based on mprotect. 
>
> that's very system-dependent.  conventional wisdom holds, for
> instance, that on Linux signals (and context switches in general) are
> very fast, so there the mprotect-based approach can be a win.  I'm not
> sure how much one can trust conventional wisdom here, though.  are
> signals as fast on SMP systems as they are on single-CPU ones?  is
> this still true for reasonably current kernel/libc versions?  I don't
> know.
>
> what was the experiments you talk about done on, if it's not a secret?

@inproceedings{ hosking92comparative,
    author = "Antony L. Hosking and J. Eliot B. Moss and Darko Stefanovic",
    title = "A comparative performance evaluation of write barrier implementations",
    booktitle = "Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications ({OOPSLA})",
    journal = "SIGPLAN Notices",
    volume = "27",
    number = "10",
    publisher = "ACM Press",
    address = "New York, NY",
    editor = "Andreas Paepcke",
    pages = "92--109",
    year = "1992",
    url = "citeseer.nj.nec.com/hosking92comparative.html" }

and

@techreport{ benjamin90barrier,
    author = "Zorn, Benjamin",
    title = "Barrier Methods for Garbage Collection",
    number = "CU-CS-494-90",
    year = "1990",
    url = "citeseer.nj.nec.com/zorn90barrier.html" }

It would be interesting to see more recent numbers, though.


-- 
~jrm
From: William D Clinger
Subject: Re: reducing number consing
Date: 
Message-ID: <fb74251e.0312200728.63d752ca@posting.google.com>
Joe Marshall wrote:
> It would be interesting to see more recent numbers, though.

I'm too lazy to type in a bunch of references, but you can follow
the references in this paper, which itself contains a bunch of
numbers:

David Detlefs, Ross Knippel, William D Clinger, and Matthias Jacob.
Concurrent Remembered Set Refinement in Generational Garbage Collection.
In the Proceedings of the 2002 USENIX Java VM Research and Technology
Symposium, August 2002.  Online as PDF or HTML:
http://research.sun.com/jtech/pubs/02-clog.pdf
http://research.sun.com/jtech/pubs/02-clog-html/clog.html

Will
From: Rob Warnock
Subject: Re: reducing number consing
Date: 
Message-ID: <1oydnS6dGuAwTnuiRVn-tA@speakeasy.net>
Joe Marshall  <·············@comcast.net> wrote:
+---------------
| Luke Gorrie <····@bluetail.com> writes:
| > It also appears that on some systems (i.e. my laptop) the garbage
| > collector causes a great deal of kernel work. Smarter use of system
| > calls may fix this, e.g. coalescing mprotect()s of adjacent pages. 
| 
| Some experiments have shown that checking the write-barrier in
| software by doing a conditional branch on every write can outperform
| hardware checking based on mprotect. 
+---------------

I remember reading some papers about that a while back. One in particular,
<URL:http://citeseer.nj.nec.com/hosking92comparative.html>, examined
various kinds of write barriers and concluded that a software write
barrier using card marking (with fairly small cards, 256-512 bytes)
out-performed kernel/VM page-based hardware write barriers:

	The page trapping scheme performed poorly in comparison to
	card marking. Interestingly, this does not appear to be due to
	the overhead of fielding page traps, since that is included in
	running time, which was not significantly higher (and often lower)
	than in the card marking schemes. Rather, it is because pages
	are too large a granule so they miss the optimum card size.

That is, a single write means that the whole page gets "dirtied"; you have
to scan the entire page at the next GC. Whereas with a card-marking system
you only have to scan the (presumably much smaller) card that was dirtied.

	We can summarize the conclusions as follows: a card size of 256
	or 512 bytes appears optimal for card marking on this hardware;
	page trapping was surprisingly effective, but is not the best
	scheme because its granularity is too large; and remembered sets[1]
	are best overall.


-Rob

[1] "Remembered sets" are yet another way of doing software write barriers,
    providing an even more precise record of the significant stores:

	...associates a remembered set with each generation, recording
	the objects or locations in older generations that may contain
	pointers into that generation. Any pointer store that creates
	a reference from an older generation to a younger generation
	is recorded in the remembered set for the younger generation.
	At scavenge time the remembered sets for the generations being
	scavenged include the heap root set for the scavenge.
	...
	Our remembered sets are implemented as circular hash tables
	using linear hashing. ...

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: reducing number consing
Date: 
Message-ID: <pan.2003.12.22.15.19.52.446908@knm.org.pl>
On Mon, 22 Dec 2003 05:21:17 -0600, Rob Warnock wrote:

> 	...associates a remembered set with each generation, recording
> 	the objects or locations in older generations that may contain
> 	pointers into that generation.

Is it better to store objects or locations? Neither is *obviously* better:
storing objects avoids storing the same object multiple times when it's
written to sequentially (if we avoid storing the same pointer twice in a
row), OTOH storing locations needs to scan fewer objects during GC. But
maybe one or the other has been measured to be usually better?

And how many previous stores is it worth to compare? Qish even compares
three recent addresses (and it stores objects, not locations).

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: William D Clinger
Subject: Re: reducing number consing
Date: 
Message-ID: <fb74251e.0312221346.e548bba@posting.google.com>
Concerning the write barrier, Marcin 'Qrczak' Kowalczyk asked:
> Is it better to store objects or locations? Neither is *obviously* better:
> storing objects avoids storing the same object multiple times when it's
> written to sequentially (if we avoid storing the same pointer twice in a
> row), OTOH storing locations needs to scan fewer objects during GC.

With a card table barrier, storing objects is likely to take two machine
instructions; storing locations is likely to take three.  It would cost
a lot more to avoid storing the same object or location twice in a row,
but one of the advantages of card tables is that storing the same object
or location twice makes no difference.

> But maybe one or the other has been measured to be usually better?

There are too many interacting variables here for anyone to draw general
conclusions on which is better.  This issue interacts with the barrier
technology, the gc technology, the application, and the hardware.

Besides, the two answers you consider are not mutually exclusive.  Some
assignments can use a write barrier that stores objects while others use
a barrier that stores locations or approximate locations (e.g. cards).

Will