From: Paul Wilson
Subject: Memory subsystems and GC (was Re: Help with Continuations)
Date: 
Message-ID: <2i1um4$ata@jive.cs.utexas.edu>
[ I'm crossposting this to comp.arch
  and comp.lang.lisp, because it fits
  right in with a thread there... PRW ]

>- As Scott Nettles mentioned, however, the surprising fact is that write
>misses don't really hurt on some machines in some circumstances.  In
>particular, Phil Koopman and I (in our TOPLAS paper in April 1991)
>demonstrated that a machine with a write-allocate strategy and sub-block
>placement reduces the write-misses penalty to essentially zero.
>Abstractly, the writes can be serviced in parallel.  At POPL94 last week,
>Amer Diwan and David Tarditi studied this more deeply for SML/NJ, and in
>addition to the above features also showed that page-mode writes are
>useful.

I'm curious how well this scales.  Will we reach a point with fast
CPU's where the sheer bandwidth of writes is too high?  Will write
bandwidths track CPU speeds, despite increasing relative latencies?
(I'd expect them not to---but maybe they'll be close enough at the
levels of the memory hierarchy that are of interest, i.e., the
interface between a cache that's big enough to hold the youngest
generation of a generational GC---some large fraction of a megabyte,
typically---and the next faster cache, which isn't.  But I'm no
architecture guru.)

>- So, *if* you have write-allocate and sub-block placement, write-misses
>on pure heap allocation come for free, and thus there is no win in this
>department in doing stack allocation.  However, stack allocation *does*
>tend to make a very very small improvement in read-miss rates, which can be
>important in some circumstances.  However, Andrew Appel's recent paper
>seems to indicate that this doesn't show up very often.

This seems bizarre to me (which is not to imply that I disbelieve it).
I'd think that writing tons of stuff into the cache would displace the
important longer-lived data structures, which would be read-missed.  In
effect, this will reduce the size of the cache because the flood of
short-lived stuff will make the cache residence times much shorter.
Interesting that this doesn't appear to be a large effect.  I'd expect
it to be workload-dependent, significant on average, and a serious
problem in some cases.  (This is just reasoning from the usual ideas
of what a cache does, how locality works, etc.  Maybe those ideas are
wrong, or are wrong for ML programs.)  It would seem to imply that a
much smaller cache will do about as well for the long-lived data, if
it doesn't have to deal with short-lived junk.  Maybe this is an artifact
of a nearly-functional with no object identity---lots of creating
new objects instead of rereferencing old ones...?

>So the bottom line?  Well, there really isn't any.  If you're running on a
>machine with the right kind of cache (like a DECstation 5000), then you'll
>do fine by doing pure heap allocation.  Other machines, like Sparc's, are
>probably a very different story.  Perhaps one can conclude that, as far as
>memory-performance is concerned, using a stack never loses.  However, it
>won't necessarily win, either.  In those cases, it might be best to
>consider other performance factors, such as instruction counts.

That seems right.  One thing to keep in mind is that a stack can be pretty
cheap in instructions, too, if implemented carefully.  You shouldn't lose
much on any architecure, and you'll win big on most of them.

>Peter Lee
>Computer Science Department
>Carnegie Mellon University

-- 
| Paul R. Wilson,   Computer Sciences Dept.,   University of Texas at Austin  |
| Taylor Hall 2.124,  Austin, TX 78712-1188       ······@cs.utexas.edu        |
| (Recent papers on garbage collection, memory hierarchies, and persistence   |
| are available via anonymous ftp from cs.utexas.edu, in pub/garbage.)        |

From: Peter Lee
Subject: Re: Memory subsystems and GC (was Re: Help with Continuations)
Date: 
Message-ID: <CK6wy7.Lny.3@cs.cmu.edu>
In article <··········@jive.cs.utexas.edu>,
Paul Wilson <······@cs.utexas.edu> wrote:
>>- So, *if* you have write-allocate and sub-block placement, write-misses
>>on pure heap allocation come for free, and thus there is no win in this
>>department in doing stack allocation.  However, stack allocation *does*
>>tend to make a very very small improvement in read-miss rates, which can be
>>important in some circumstances.  However, Andrew Appel's recent paper
>>seems to indicate that this doesn't show up very often.
>
>This seems bizarre to me (which is not to imply that I disbelieve it).
>I'd think that writing tons of stuff into the cache would displace the
>important longer-lived data structures, which would be read-missed.  In
>effect, this will reduce the size of the cache because the flood of
>short-lived stuff will make the cache residence times much shorter.
>Interesting that this doesn't appear to be a large effect.  I'd expect
>it to be workload-dependent, significant on average, and a serious
>problem in some cases.  (This is just reasoning from the usual ideas

I agree.  However, the benchmarks I've seen (which is not all that many,
perhaps less than 25 programs) don't exhibit this behavior.  As you say,
this may be an artifact of the SML programming style, which might tend
towards creation of new objects rather than multiple reads of existing
ones.  And perhaps the sheer volume of closure/activation records
allocated, whether on the stack or on the heap, is a factor.  It will be
interesting to get a close look at Appel's data when his paper appears.  At
any rate, I am almost certain that one could find programs for which
read-miss penalties were significant to the overall performance of the
program.

>| Paul R. Wilson,   Computer Sciences Dept.,   University of Texas at Austin  |
>| Taylor Hall 2.124,  Austin, TX 78712-1188       ······@cs.utexas.edu        |
>| (Recent papers on garbage collection, memory hierarchies, and persistence   |
>| are available via anonymous ftp from cs.utexas.edu, in pub/garbage.)        |

Peter Lee
Computer Science Department
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh  PA  15213-3891
From: Scott Nettles
Subject: Re: Memory subsystems and GC (was Re: Help with Continuations)
Date: 
Message-ID: <CK749w.2Jr.3@cs.cmu.edu>
In article <············@cs.cmu.edu>, Peter Lee <······@cs.cmu.edu> wrote:
>In article <··········@jive.cs.utexas.edu>,
>Paul Wilson <······@cs.utexas.edu> wrote:
>>This seems bizarre to me (which is not to imply that I disbelieve it).
>I agree.  However, the benchmarks I've seen (which is not all that many,
>perhaps less than 25 programs) don't exhibit this behavior.  As you say,
>this may be an artifact of the SML programming style, which might tend
>towards creation of new objects rather than multiple reads of existing
>ones.  And perhaps the sheer volume of closure/activation records
>allocated, whether on the stack or on the heap, is a factor.  It will be

Mark Reinhold in his Ph.D. thesis gets very similar results for scheme
programs using the T implementation which is quite aggressive in its use of a
stack.  This suggests that it's SML specific only to the extent that the two
languages share a coding style.  Off hand I'd say that both emphasize
creation over modification so WRT the issues at hand they're pretty much
alike.  Mark assumes good write performance, arguing that such features are
useful to C and Fortran and thus will become more common.  His benchmarks
are more substantial than the SML ones I've seen.  In his thesis, Mark gives
some explanation/speculation about why one gets these results which I won't
try and reproduce.  His thesis is available as an MIT tech report, but I
don't have the number.  It should be easy to find, it appeared sometime in
the fall of 1993.  It's very good work, and I highly recommend it to anyone
interested in these issues.

Scott
From: Mark Reinhold
Subject: Re: Memory subsystems and GC (was Re: Help with Continuations)
Date: 
Message-ID: <MBR.94Jan25180018@sting4.nj.nec.com>
In article <············@cs.cmu.edu> Scott Nettles <········@cs.cmu.edu>
mentions my recent Ph.D. work, which addresses this topic.  The abstract of the
technical-report version of my dissertation is included below.

A cache write-miss policy of write-validate (a.k.a. write-allocate with
sub-block placement) is not as important to the Scheme programs that I studied
as it is for the SML programs studied by Diwan et al.  For R3000-class
machines, this optimization yields a significant improvement only for
relatively large caches (>= 256KB); for faster machines (say, with a 2ns
clock), this policy is required for good performance except when caches are
very large (>= 2MB).  This is probably explained by the differences between
Scheme and ML programming styles (the latter being a little more "pure"), and
by the fact that the Scheme compiler that I used (Orbit) does a good job of
stack-allocating closures, thereby reducing the allocation (i.e., write-miss)
rate.  This result is consistent with Lee's report that stack allocation
dramatically reduces the write-miss rate of SML programs.

A further advantage of stack allocation is that the reduction in the
heap-allocation rate implies a reduction in the rate at which the allocation
pointer sweeps through the cache, which in turn gives heap-allocated objects a
longer time to stay in the cache (from the time they are allocated) before
being flushed because their cache blocks are required for the next round of
allocation.  (Newly-allocated objects may be flushed by references to older
objects, of course, but this is rare.)  Given the short lifetimes of most
heap-allocated objects, the magnitude of this second advantage is probably
small.  But it might explain Lee's observation that, with write-validate
caches, "stack allocation *does* tend to make a very very small improvement in
read-miss rates".

--------

			     Cache Performance of
		    Garbage-Collected Programming Languages

			       Mark B. Reinhold

		      MIT Laboratory for Computer Science
			     Technical Report 581
				September 1993


As processor speeds continue to improve relative to main-memory access times,
cache performance is becoming an increasingly important component of program
performance.  Prior work on the cache performance of garbage-collected
programming languages has either assumed or argued that conventional
garbage-collection methods will yield poor performance, and has therefore
concentrated on new collection algorithms designed specifically to improve
cache-level reference locality.  This dissertation argues to the contrary: Many
programs written in garbage-collected languages are naturally well-suited to
the direct-mapped caches typically found in modern computer systems.

Using a trace-driven cache simulator and other analysis tools, five nontrivial,
long-running Scheme programs are studied.  A control experiment shows that the
programs have excellent cache performance without any garbage collection at
all.  A second experiment indicates that the programs will perform well with a
simple and infrequently-run generational compacting collector.

An analysis of the test programs' memory usage patterns reveals that the
mostly-functional programming style typically used in Scheme programs, in
combination with simple linear storage allocation, causes most data objects to
be dispersed in time and space so that references to them cause little cache
interference.  From this it follows that other Scheme programs, and programs
written in similar styles in different languages, should perform well with a
simple generational compacting collector; sophisticated collectors intended to
improve cache performance are unlikely to be effective.  The analysis also
suggests that, as locality becomes ever more important to program performance,
programs written in garbage-collected languages may turn out to have a
significant performance advantage over programs written in more conventional
languages.

--------

Copies of this report may be ordered from MIT LCS Publications, 545 Technology
Square, Cambridge, Massachusetts 02139; 617-253-5851; <····@lcs.mit.edu>.
This report is not available electronically.
From: Paul Wilson
Subject: Re: Memory subsystems and GC (was Re: Help with Continuations)
Date: 
Message-ID: <2i7bma$dhq@jive.cs.utexas.edu>
In article <·················@sting4.nj.nec.com>,
Mark Reinhold <···@research.nj.nec.com> wrote:
>
>A cache write-miss policy of write-validate (a.k.a. write-allocate with
>sub-block placement) is not as important to the Scheme programs that I studied
>as it is for the SML programs studied by Diwan et al.  For R3000-class
>machines, this optimization yields a significant improvement only for
>relatively large caches (>= 256KB);

This sounds peculiar.  I would think that the write-validate thing would
be most effective for caches smaller than the youngest generation.  Did
you have a generational GC?  What size youngest generation were these
results for?

> for faster machines (say, with a 2ns
>clock), this policy is required for good performance except when caches are
>very large (>= 2MB).

This sounds right, but I'd think you wouldn't need 2MB with a generational
GC and a youngest generation of a megabyte or whatever.

>This is probably explained by the differences between
>Scheme and ML programming styles (the latter being a little more "pure"), and
>by the fact that the Scheme compiler that I used (Orbit) does a good job of
>stack-allocating closures, thereby reducing the allocation (i.e., write-miss)
>rate.  This result is consistent with Lee's report that stack allocation
>dramatically reduces the write-miss rate of SML programs.

And our report that we got huge reductions in heap allocation with a very
simple compiler hack and stack cache.  Unfortunately, we thought that
this would be so obviously related to the cache miss rate that we didn't
bother to trace the system without the stack cache and report those
results---we just assumed it was obvious enough that if you get a lot of
misses with one, life will be considerably worse without it.

>A further advantage of stack allocation is that the reduction in the
>heap-allocation rate implies a reduction in the rate at which the allocation
>pointer sweeps through the cache, which in turn gives heap-allocated objects a
>longer time to stay in the cache (from the time they are allocated) before
>being flushed because their cache blocks are required for the next round of
>allocation.  (Newly-allocated objects may be flushed by references to older
>objects, of course, but this is rare.)  Given the short lifetimes of most
>heap-allocated objects, the magnitude of this second advantage is probably
>small.  But it might explain Lee's observation that, with write-validate
>caches, "stack allocation *does* tend to make a very very small improvement in
>read-miss rates".

It's still interesting that these programs don't seem to be building
big data structures that they traverse a lot.  I wonder if an object-oriented
style would be different---updating more data structures in place rather
than applicatively.

Trashing your cache continually *ought* to cost something significant,
or you don't get much benefit from increased cache size.  But maybe the
benefit of a larger cache (for these programs) is only in keeping more
and more of the fairly-recently-allocated stuff around, and maybe that's
where all the action is.

>			     Cache Performance of
>		    Garbage-Collected Programming Languages
>
>As processor speeds continue to improve relative to main-memory access times,
>cache performance is becoming an increasingly important component of program
>performance.  Prior work on the cache performance of garbage-collected
>programming languages has either assumed or argued that conventional
>garbage-collection methods will yield poor performance, and has therefore
>concentrated on new collection algorithms designed specifically to improve
>cache-level reference locality.  This dissertation argues to the contrary: Many
>programs written in garbage-collected languages are naturally well-suited to
>the direct-mapped caches typically found in modern computer systems.

  I assume you mean me when you talk about people arguing that conventional
garbage collection methods will yield poor cache performance.  In my one
refereed publication on the subject, I said that the problem was NOT large
for (then-)current systems, but would likely be so in faster future systems.
I still hold to that view for the most part.  If you don't have a big cache
(relative to the youngest generation), you'll have trouble on a fast machine
with most common cache configurations.  It turns out you'll do okay if
you have write-validate (with the line size same as initializing write size),
but that's essentially what I said, too---you'll get better performance
with an optimization that allows you to allocate in the cache.  I simply
had't realized that the write-validate trick will do that for you on some
machines now.

>Using a trace-driven cache simulator and other analysis tools, five nontrivial,
>long-running Scheme programs are studied.  A control experiment shows that the
>programs have excellent cache performance without any garbage collection at
>all.

Is this assuming write-validate and/or a machine that's not fast (like, not
next year's Alpha)?  If not, it's particularly interesting.  It would seem
to suggest that programs mostly reference stuff in the last cache-full of
allocation, and/or longer-lived stuff that's so hot it's unlikely to get
bumped out. 

>A second experiment indicates that the programs will perform well with a
>simple and infrequently-run generational compacting collector.
>
>An analysis of the test programs' memory usage patterns reveals that the
>mostly-functional programming style typically used in Scheme programs, in
>combination with simple linear storage allocation, causes most data objects to
>be dispersed in time and space so that references to them cause little cache
>interference.

Are you sure this is what you're seeing?  Or is the spatial locality simply
very good, so that you almost never touch more than a few things that were
allocated more than a cache-full of allocation ago?  As long as you reference
things within a contiguous area smaller than the cache, you'll get NO conflicts
at all.  (In this case, the area you linearly allocated the last cache-full
of stuff in.)  that's even better than effectively random placement, and I
wouldn't be surprised if it's an important effect.

>  From this it follows that other Scheme programs, and programs
>written in similar styles in different languages, should perform well with a
>simple generational compacting collector; sophisticated collectors intended to
>improve cache performance are unlikely to be effective.  The analysis also
>suggests that, as locality becomes ever more important to program performance,
>programs written in garbage-collected languages may turn out to have a
>significant performance advantage over programs written in more conventional
>languages.

That would be very, very nice, if true.  And it sounds like other people
are seeing similar stuff, so I'm hopeful.

I think you want to be careful about how far you generalize this.  We
haven't seen any results for GC'd languages with more conventional
programming styles and less heap allocation (e.g., Eiffel, Modula-3 and
Oberon).  They may have more of an emphasis on referencing older data
structures, as opposed to continually creating new ones.  (I have no
idea if that's really true.  Their older structures may have a heavily
skewed reference distribution, and you may be OK anyhow.)

-- 
| Paul R. Wilson,   Computer Sciences Dept.,   University of Texas at Austin  |
| Taylor Hall 2.124,  Austin, TX 78712-1188       ······@cs.utexas.edu        |
| (Recent papers on garbage collection, memory hierarchies, and persistence   |
| are available via anonymous ftp from cs.utexas.edu, in pub/garbage.)        |
From: Fernando Mato Mira
Subject: Re: Memory subsystems and GC (was Re: Help with Continuations)
Date: 
Message-ID: <2i8eop$8ov@disuns2.epfl.ch>
In article <··········@jive.cs.utexas.edu>, ······@cs.utexas.edu (Paul Wilson) writes:

> big data structures that they traverse a lot.  I wonder if an object-oriented
> style would be different---updating more data structures in place rather
> than applicatively.

I'm getting very worried about the write barrier problem. Wouldn't it make
sense to reorganize older spaces into `dynamic space' and `static space' with
objects moving from one to the other as the frequency of their update with
pointers to newer objects change?

-- 
F.D. Mato Mira                           
Computer Graphics Lab    ········@epfl.ch
EPFL                     FAX: +41 (21) 693-5328
From: Stefan Monnier
Subject: Re: Memory subsystems and GC (was Re: Help with Continuations)
Date: 
Message-ID: <2i8i86$kk1@disuns2.epfl.ch>
In article <··········@disuns2.epfl.ch>,
Fernando Mato Mira <········@di.epfl.ch> wrote:
> I'm getting very worried about the write barrier problem. Wouldn't it make
> sense to reorganize older spaces into `dynamic space' and `static space' with
> objects moving from one to the other as the frequency of their update with
> pointers to newer objects change?

But then, you'd have to check at each pointer write if the pointer
points to dynamic or static space, which may take just as much time as
a card mark.
(on the other hand, a VM-based write barrier would probably make wise
use of such a separation (as was done for the TI Explorer I think))

	Stefan
-- 

-----------------------------------------------------
-- On the average, people seem to be acting normal --
-----------------------------------------------------
From: Amer Diwan
Subject: Re: Memory subsystems and GC (was Re: Help with Continuations)
Date: 
Message-ID: <DIWAN.94Jan27150614@ibis.cs.umass.edu>
>>>>> On 26 Jan 1994 21:16:26 -0600, ······@cs.utexas.edu (Paul Wilson) said:
Paul> Mark Reinhold <···@research.nj.nec.com> wrote:
>A cache write-miss policy of write-validate (a.k.a. write-allocate with
>sub-block placement) is not as important to the Scheme programs that I studied
>as it is for the SML programs studied by Diwan et al.  For R3000-class
>machines, this optimization yields a significant improvement only for
>relatively large caches (>= 256KB);

Paul> This sounds peculiar.  I would think that the write-validate thing would
Paul> be most effective for caches smaller than the youngest generation.  

I have been looking at Mark's thesis and the answer to your question is on
page 17.  He gives a graph plotting the fraction of all fetches that are
allocation fetches (i.e., fetches when initializing newly allocated data) in a
cache without subblock placement.  The _fraction_ does increase for larger
caches: i.e., write validate (aka subblock placement) removes a greater
fraction of the memory fetches for larger caches but that doesn't mean that it
removes _more_ fetches.  It just means that the non-allocation fetches drop
off more quickly as the cache size increases, thus increasing the fraction. 

Paul's comment about write-validate being most effective for caches smaller
than the youngest generation is true for our experiments with SML/NJ.  For
small caches (usually less than 512K), our programs exhibited much better
behavior with write validate than with other write-miss policies.  For caches
>= 512K in many of the programs, write valudate offered little or no
improvement.

    Amer
--

------------------------------------------------------------------------------
Amer Diwan                                    Department of Computer Science
Object Systems Lab                            University of Massachusetts   
·····@cs.umass.edu                            Amherst, MA 01003
(413) 545-0256
From: David Tarditi
Subject: Re: Memory subsystems and GC (was Re: Help with Continuations)
Date: 
Message-ID: <CK7BBv.6Hv.3@cs.cmu.edu>
``Memory Subsystem Performance of Programs with
 Copying Garbage Collection.'' 21st ACM SIGPLAN-SIGACT
Symposium on Principles of Programming Language, Portland, Oregon,
Jan. 17-19, 1994, pp. 1-13.

It is available by anonymous ftp from reports.adm.cs.cmu.edu
in 1993/CMU-CS-93-210.ps.

An extended version of this paper, titled, ``Memory Subsystem
Performance of Programs with Intensive Heap Allocation'' is
available as CMU CS technical report 93-227.

It is available by anonymous ftp from reports.adm.cs.cmu.edu
in 1993/CMU-CS-93-227.ps.
  
The respective abstracts are below.
----
Heap allocation with copying garbage collection is believed to have
poor memory subsystem performance.  We conducted a study of the memory
subsystem performance of heap allocation for memory subsystems
found on many machines.  We found that many machines support heap
allocation poorly.  However, with the appropriate memory subsystem
organization, heap allocation can have good memory subsystem performance.
-----
Heap allocation with copying garbage collection is a general storage
management technique for modern programming languages. 
It is believed to have poor memory subsystem performance.
To investigate this, we conducted an  in-depth study of the memory
subsystem performance of heap allocation for memory subsystems
found on many machines.   We studied the performance of mostly-functional
Standard ML programs which made heavy use of heap allocation.
We found that most machines support heap
allocation poorly.  However, with the appropriate memory subsystem
organization, heap allocation can have good performance.  The memory
subsystem property crucial for achieving good performance was
the ability to allocate and initialize a new object into the
cache without a penalty.   This can be achieved by having 
subblock placement with a subblock size of one word with a write
allocate policy, along with fast page-mode writes or a write buffer.
For caches with subblock placement, the data cache overhead was
under 9\% for a 64K of larger data cache; without subblock placement
the overhead was often higher than 50\%.