From: Chris Dollin
Subject: avoiding the retention of trash
Date: 
Message-ID: <KERS.91Jun13120231@cdollin.hpl.hp.com>
I'm implementing a Pop-like language (for the purposes of this discussion, we
can treat Pop as a Lisp-like language), and have hit a worrying possible
problem with garbage collection; or, more exactly, with garbage
*non*-collection. 

It goes like this. On entry to a procedure, a frame is allocated from the
current stack chunk, with enough room to hold all the local variables of the
procedure. (It doesn't make any essential difference if some locals are in
registers; as it happens, the current implementation is a byte-coded virtual
machine, so it's not an issue.) 

Suppose a garbage collection occurs before all the stack slots have been filled
(for example, in an initialisation expression). I can ensure that all the
as-yet unfilled slots have ``sensible'' values (ie do not hold random garbage
that can crash the collector or corrupt store), *but* they may be holding on to
store which was only referred to by a previous use of that stack slot. For
example:

    define procedure make_trash() as
        val pad = 42;
        val junk = { repeat 100000 times false endrepeat }; ;;; make big vector
    enddefine;

    define procedure caught_out() as
        val x = expression_invoking_gc();
        val y = 42;
    enddefine;
 
    define procedure main() as
        make_trash();
        caught_out()
    enddefine;

When caught_out runs, the y slot will be holding on to the big vector allocated
by make_trash.

Now, for procedures running toward the tip of the call tree this may not be too
serious; the locals will probably be overwritten soon, and the next GC will
reclaim the store. But procedures deeper in the stack may *also* be holding on
to dead store - and they won't let go for quite a while.

So, question 1: how much of a problem is this?

Question 2: if it is a problem, how do I solve it? I can think of several
possible solutions:

[a] On procedure entry, zap all local slots with safe values (like numeric 0,
or nil, or false). Cost: every entry is slowed down by the zapping.

[b] Don't allocate locals all at once: allocate them one at a time, when
they're filled. Cost: allocation is no longer as simple as a simple Store
Local, and loops with local declarations pay each time round the body.

[c] Shortly after a local becomes unused (ie, there are no further references
to it), zap it (unnecessary if the compiler can prove it's not a heap value).
In the limit, zap them all at exit. Cost: much like [1], but possibly easier to
optimise. Loops are still a problem.

[d] Record the live ranges of each stack slot and bind into the procedure a
mapping from code position to active ranges. The GC then knows where the *real*
top-of-stack is, and avoids scanning unused slots. Cost: more work for the GC
scanning frames, more work for the compiler in setting up the information, and
store is required (presumably in the procedure header) to store the
description.

Any advice, references, assurances? I'm particularly interested in solutions
that apply when running native code, not interpreted byte-codes, as I plan to
move to a native-code compiler sometime during the next year.
--

Regards, Kers | "Once   I would have been glad   to have made your acquaintance
Renaissance:  | Now   I feel the danger   brought about   by circumstance."

From: Chris Dollin
Subject: Re: avoiding the retention of trash
Date: 
Message-ID: <KERS.91Jun13120751@cdollin.hpl.hp.com>
I forgot to add another approach:

[5] Don't re-use stack frames: allocate fresh ones from the heap. Cost: more GC
(because of the stack-frame turnover), may need fancier GC to cope, and must
take care about not-yet-initialised variables anyway. (I can't rely on virtual
memory hardware to help; not all my targets *have* virtual memory, or even
multiple address spaces.)
--

Regards, Kers | "Once   I would have been glad   to have made your acquaintance
Renaissance:  | Now   I feel the danger   brought about   by circumstance."
From: Paul Wilson
Subject: Use a stack (was Re: avoiding the retention of trash)
Date: 
Message-ID: <1536@yoakum.cs.utexas.edu>
In article <··················@cdollin.hpl.hp.com> ····@hplb.hpl.hp.com (Chris Dollin) writes:
>
>I forgot to add another approach:
>
>[5] Don't re-use stack frames: allocate fresh ones from the heap. Cost: more GC
>(because of the stack-frame turnover), may need fancier GC to cope, and must
>take care about not-yet-initialised variables anyway. (I can't rely on virtual
>memory hardware to help; not all my targets *have* virtual memory, or even
>multiple address spaces.)

My advice is to use a stack.  You'd be amazed how much extraneous junk you
get on the heap if you don't have a stack.  For some programs I ran, using
a stack cache reduced heap allocation in a Scheme system by MORE THAN AN
ORDER OF MAGNITUDE on average.  It might be different in a system that
did more inlining and therefor less procedure-calling, but I think you'll
still get several times as much stuff on the heap.

With a simple gc, this is going to increase your space or time costs by
a large factor, and gc costs are likely to be the dominant cost for many
programs.

With a generational GC, you still don't want to heap-allocate activation
records -- you want *both* a stack (or stack cache) and a generational gc.
(If you *must* get by with only a gc, make sure it has three or more
generations -- the youngest one will be used mostly for killing those
pesky activation records.)

One of the side-effects of rampant heap allocation is to make cache memories
much less effective -- all of the short-lived data romp through the cache,
causing misses and displacing things that get missed on later.  Unless you
have a cache big enough to hold your youngest generation, you get cache miss
traffic equal to twice the rate of allocation.

This effect is particularly bad in a direct-mapped cache, because it
undermines some of the locality assumptions that direct-mapped caches are
based on.  You not only get (capacity) miss traffic greater than the rate of
allocation, you get even more misses due to mapping conflicts.  Very bad.
In my simulations I see miss rates for direct-mapped caches anywhere from
about 50% higher (for small caches) to 200-600% higher (for large ones that
can hold the youngest generation in cache).

(These are for unified I/D caches.  Split caches help with small cache
sizes -- which is good -- but not so much for large ones.)

If you're interested in this sort of thing, you might want to check out
my paper "Heap Management and Hierarcical Memories," in SIGPLAN Notices
a few months back (Feb? March?).  I also have a tech report about GC'd
systems' cache performance.

  -- PRW

-- 
| Paul R. Wilson, Interactive Computing Envts. Lab. lab ph.: (312) 996-9216   |
| U. of Illinois at Chicago EECS Dept. (M/C 154)    ······@bert.eecs.uic.edu* |
| P.O. Box 4348   Chicago,IL 60680                  fax ph.: (312) 413-0024   |
| *Next yr, Asst Prof, UT Austin CS Dept-- after 8/1 use ······@cs.utexas.edu |
From: Barry Margolin
Subject: Re: avoiding the retention of trash
Date: 
Message-ID: <1991Jun13.210829.8799@Think.COM>
In article <··················@cdollin.hpl.hp.com> ····@hplb.hpl.hp.com (Chris Dollin) writes:
>[c] Shortly after a local becomes unused (ie, there are no further references
>to it), zap it (unnecessary if the compiler can prove it's not a heap value).
>In the limit, zap them all at exit. Cost: much like [1], but possibly easier to
>optimise. Loops are still a problem.

This is a bit more work, but seems like the best solution to me.  It solves
not only this particular problem, but also the problem of retained garbage
in active procedures.  For instance, consider the following:

(let* ((big-temp (compute-big-thing))
       (little-temp (f big-temp)))
  ;; body only uses little-temp
  )

Once you put this type of variable lifetime analysis into the compiler, it
can be used for other things as well.  For instance, in the above code, the
same stack location could actually be used for both big-temp and
little-temp, because their lifetimes don't overlap.  Big-temp's value would
be zapped automatically when you store into little-temp (a peephole
optimizer could remove the redundant store of nil, if necessary).
-- 
Barry Margolin, Thinking Machines Corp.

······@think.com
{uunet,harvard}!think!barmar