From: Paul Wilson
Subject: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <3v8g7l$cge@jive.cs.utexas.edu>
In article <··················@aruba.apple.com>,
Ken Dickey <····@apple.com> wrote:
>At 11:38 AM 95/07/26 -0400, Scott McKay wrote:
>...
>
>>Another thing to note is that compacting GC's usually dramatically
>>reduce ther overall working set of programs, which in turn reduces the
>>number of page faults a program might take.  ...

I don't think this is quite right.  GC usually costs you a little bit of 
locality, relative to a well-designed explicit allocator.  There's very
little data on locality of allocators, however.  (Zorn's done some basic
stuff, but I don't regard it as conclusive.)  We hope to fix that sometime
soon.

GC costs space, because you can't reuse memory immediately--you have
to wait until the GC notices it's garbage.  All other things being
equal, this is bad news.  Generational GC helps, but does not eliminate
the problem, especially at the level of high-speed caches (as opposed
to virtual memory).  If the youngest generation is larger than the
cache, the number of cache misses can go WAY up.  (But this may or
may not cost you a lot of stalls.)

A poorly-designed GC (or malloc) can be a locality disaster, because
it can scramble related data.  A naive copying collector may do this
when it hits a hash table---it reaches the structures referenced from
the hash table in pseudo-random order, and tends to interleave things
in memory as it traversese and copies them.  A conventional next fit
allocator (i.e., first fit with a roving pointer) may do something
similar.  Next fit is bad news.  (See our PLDI '91 paper for demonstrations
of the effects of copying data in randomized order, and our allocator
survey paper for a discussion of next fit.  The latter is on our ftp
site.)

>>So even though time-to-market considerations are important (and I'll
>>bet a GC wins this race every time), I now believe that even your
>>basic overall performance wins in modern GC-based systems.
>
>Which brings to mind another paper: "Cache Performance of Garbage Collected
>Programs" by Mark Reinhold, L&FP '94.  Mark presents evidence that even a
>stop & copy collector exhibits good cach performance when amortized over a
>program's  execution lifetime and argues that complex compaction schemes
>are not required.  Mark also conjectures that a gc'ed mostly functional
>style programming is significantly more efficient than imperative style
>non-gc'ed programs in a cached VM system ("allocation is faster than
>mutation") and the performance gains will be more significant as CPU's
>continue to get faster relative to memory systems.

I'm quite skeptical of this.  When you romp through lots of memory on
a regular basis (e.g., with a normal tracing GC, during allocation *between*
GC's), you miss your cache like crazy.  On some architectures, this is
not in itself a big deal, because the cache effectively optimizes away
the fetching of garbage blocks from memory, and allocates blocks directly
in the cache.  (With good write buffers and high cache-to-memory bandwidth,
the evictions may not be a big deal either.)  On other architectures, it is
a big lose because everytime you allocate another block of memory, you fault
in a block of garbage just before you overwrite it during allocation.  This
is very dependent on the details of the cache architecture, and most current
processors (including most high-end processors) do NOT have the GC-friendly
features.

I'm especially skeptical of the scalability argument.  As processor speeds
grow relative to memory *bandwidth*, the write-backs caused by profligate
heap allocation will become a bottleneck.  Whenever you allocate a block,
you have to evict something, and in a GC'd system, the block you evict
is almost always dirty (because it was written to when you allocated something
there a little while ago).  Prefetching and preflushing don't help with
this once you're bandwidth-limited.  Reusing memory while it's still in 
the cache is a better and better idea as processors get faster.

(Appel's new (FPCA '95) paper deals with the issue of interactions between
cache design and GC design.  It pretty much vindicates my much-attacked
LFP '92 paper.  As it turns out, everybody seems to be right---I'm right
for most architectures, but the "GC has good cache locality" camp (e.g.,
Reinhold; Tarditi, Diwan, and Moss; Appel's earlier work) is right if
you have a NICE cache.)

The bottom line, in my view, is that if you're targeting a range of
machines and can't pick your cache design, you should be very careful
about locality of reference during allocation.  Allocate on the stack
when you can do so safely, not on the heap---especially for very fast
processors, where cache stalls start to be really important.

For more info on these things, see the papers on our ftp site.

  Paul
-- 
| Paul R. Wilson, Comp. Sci. Dept., U of Texas @ Austin (······@cs.utexas.edu)
| Papers on memory allocators, garbage collection, memory hierarchies,
| persistence and  Scheme interpreters and compilers available via ftp from 
| ftp.cs.utexas.edu, in pub/garbage or http://www.cs.utexas.edu/users/oops/)      

From: Stefan Monnier
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <3vac07$ptf@info.epfl.ch>
[comp.arch added since the thread is drifting towards cache design]

In article <··········@jive.cs.utexas.edu>,
Paul Wilson <······@cs.utexas.edu> wrote:
] In article <··················@aruba.apple.com>,
] Ken Dickey <····@apple.com> wrote:
] >At 11:38 AM 95/07/26 -0400, Scott McKay wrote:
] >>Another thing to note is that compacting GC's usually dramatically
] >>reduce ther overall working set of programs, which in turn reduces the
] >>number of page faults a program might take.  ...

Is there hard data on this one ?
The only case where the GC was really able to reduce the working set size (at
the VM level) is for the TI Explorer. I don't have the reference at hand, but
they compared (on that TI machine) a plain generational copying collector to the
fancy incremental/generational/locality-improving GC for a CAD routing problem.
In most cases, the fancy GC was a slight performance penalty, except in the case
where the machine had little memory, in which case the paging behavior was
noticeably better for the fancy GC (it took about a day to finish, whereas they
gave up on the test for the plain GC after a week). Stramgely, their conclusion
was that the plain GC was superior since it's a lot simpler and in most cases
faster (slightly, but still). Of course the incremental aspect of the GC wasn't
seen as an advantage for a big CAD routing problem (this does call for
parameterizable GC behavior, right ?).

] I don't think this is quite right.  GC usually costs you a little bit of 
] locality, relative to a well-designed explicit allocator.  There's very

I guess it depends on what you call locality. If you only consider the concept
(far from hardware problems), I can't see why: the set of objects used is the
same and I can't see why the layout of objects has to be worse for a GC than for
a manual allocator. Of course, actual hardware generally behaves worse with a GC
because it doesn't quite understand the a way a GC works. The better the cache
can "understand" the GC, the better the locality.

] little data on locality of allocators, however.  (Zorn's done some basic
] stuff, but I don't regard it as conclusive.)  We hope to fix that sometime
] soon.

I do hope so too, but it's not gonna be easy: there are so many variants of GC
and so many variants of memory hierarchies...

] (Appel's new (FPCA '95) paper deals with the issue of interactions between
] cache design and GC design.  It pretty much vindicates my much-attacked
] LFP '92 paper.  As it turns out, everybody seems to be right---I'm right
] for most architectures, but the "GC has good cache locality" camp (e.g.,
] Reinhold; Tarditi, Diwan, and Moss; Appel's earlier work) is right if
] you have a NICE cache.)

This is the "understanding" problem: if the cace can notice that reading the
line is not necessary when allocating a new object, it means the cache
understands better the way the GC-allocator works. You're point about the memory
bandwidth of the write-backs could be argued against in saying that there should
be a way for the cache to notice that most of the write-backs aren't necessary
(I'm assuming here that the generations' sizes have been carefuly chosen so that
they "fit the memory hierarchy" (whatever this really means in terms of relative
sizes): in such a case most write-backs will be for "old-space" objects that
have been recognized as dead already). Since there will still be write-backs of
objects that *are* dead but haven't been recognized as such yet, I guess manual
allocators still have an edge. But it might be balanced by the compacting effect
of copying collectors.

For lack of hard data, this is probably blowing hot air, tho !
But the point still holds: it'd probably be a good idea to provide ways to
tell the memory hierarchy what's going on (like a "trash" instruction to
indicate that some memory region can be trashed (to free the old-space) or a
destructive read (can be used for linear-programming, or for stack-pops)).

] The bottom line, in my view, is that if you're targeting a range of
] machines and can't pick your cache design, you should be very careful
] about locality of reference during allocation.  Allocate on the stack
] when you can do so safely, not on the heap---especially for very fast
] processors, where cache stalls start to be really important.

Too true: current systems are designed mainly to run stack-based programs, so
it's always safer to try to stick as much as possible to the same paradigm.
Appel's argument about GC and stack being just as fast (while the GC approach is
simpler since there *is* a GC anyway) is not convincing enough: the added
complexity buys you predictability (and it's a "one-time" complexity cost).


	Stefan
From: Paul Wilson
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <3vb382$dtr@jive.cs.utexas.edu>
In article <··········@info.epfl.ch>,
Stefan Monnier <··············@epfl.ch> wrote:
>[comp.arch added since the thread is drifting towards cache design]

Then I should restate a couple of things:  I'm very much pro-GC, for software
engineering reasons, but I think that it ultimately costs you a little
bit in locality, relative to a similarly good explicit deallocator
(malloc/free kind of thing).

Unfortunately, comparisons are quite difficult because the state of
allocator research and development is a bit of a mess.  We don't know
much more than we did in 1971, which wasn't a whole lot.

This is important for computer architecture, because locality is 
strongly affected by the way storage is managed.  Programs reference
data objects, but data objects are mapped onto memory by the allocator
(and the memory hierarchy's own mechanisms).

Failure to realize this is one of the reasons we don't know much about
locality that we didn't know in 1971, either.  (When was the last time
you saw a cache evaluation paper that even mentioned what implementation
of malloc() was used?  Architects generally aren't away of the potentially
large differences in locality for the same programs using different
allocators.)

>In article <··········@jive.cs.utexas.edu>,
>Paul Wilson <······@cs.utexas.edu> wrote:
>] In article <··················@aruba.apple.com>,
>] Ken Dickey <····@apple.com> wrote:
>] >At 11:38 AM 95/07/26 -0400, Scott McKay wrote:
>] >>Another thing to note is that compacting GC's usually dramatically
>] >>reduce ther overall working set of programs, which in turn reduces the
>] >>number of page faults a program might take.  ...
>
>Is there hard data on this one ?

Essentially none.  Compaction is overrated, by the way.  It can do as much
damage as good, if it's not done *right*.  As far as comparisons between
conventional and GC'd systems go, there is essentially zero data on locality
effects, and there's every reason to think that locality effects are important.

The allocator literature is much worse than the GC literature in this respect.
Almost nothing is known about the behavior of allocators with real programs,
because the research mostly got off into a dead end of studying fragmentation
for synthetic programs, which are systematically and grossly unrealistic.

For more info on this, see the paper allocsrv.ps in our repository of papers
on memory management.  (ftp.cs.utexas.edu, in the directory pub/garbage.)
This is a long survey and literature review on memory allocators.  It makes
a few basic points about locality, as well as fragmentation and time costs.

>The only case where the GC was really able to reduce the working set size
>(at the VM level) is for the TI Explorer. I don't have the reference at
> hand,

My GC survey paper and memory management bibliography have citations for
this and some similar work.  (Also available in our ftp repository.)

> [...]

>] I don't think this is quite right.  GC usually costs you a little bit of 
>] locality, relative to a well-designed explicit allocator.  There's very
>
>I guess it depends on what you call locality. If you only consider the 
>concept (far from hardware problems), I can't see why: the set of objects
>used is the same and I can't see why the layout of objects has to be worse
>for a GC than for a manual allocator.

The basic layout issues are the same, I think.  A compacting GC has a lot
of freedom to reorganize data to improve locality or at least not to damage
it much.  But a conventional allocator also has huge freedom in its initial
placement of objects, and this can dramatically affect locality.  

(This has not been seriously studied, for no reason I can figure out---the
literature on conventional allocators is much less sophisticated about
locality than the literature on GC's.  It may be due in part to the fact
that GC people learned about locality the hard way, because early GC's
often had quite horrible locality.)

> Of course, actual hardware generally behaves worse with a GC
>because it doesn't quite understand the a way a GC works. The better the cache
>can "understand" the GC, the better the locality.

I think this is partly right, but only partly.  GC's (except for reference
counting GC's) tend to do something that's intrinsically nasty for locality;
they pollute the cache with short-lived objects and can't *reuse* that memory
until a garbage collection detects that the memory is reusable.  You can limit
the damage this causes, but you can't eliminate it entirely.

>] little data on locality of allocators, however.  (Zorn's done some basic
>] stuff, but I don't regard it as conclusive.)  We hope to fix that sometime
>] soon.
>
>I do hope so too, but it's not gonna be easy: there are so many variants of GC
>and so many variants of memory hierarchies...

Yes.  On the other hand, I think one of the things that has limited progress
is a failure to focus on the fundamentals---namely program behavior and
what an allocator (or GC) can do to map that behavior nicely onto memory.
Architects tend to have a very naive view of locality, which is that it's
like a natural phenomenon that either is or isn't there.  In fact, it's a
product of patterned program behavior interacting with patterned allocator
(or GC) strategies, and these must be studied in relative isolation to
figure out what's really going on.  I believe that there are common and
simple patterns in program behavior that have not been noticed or exploited,
and that this is why there's been very little progress in the study of
allocators and the study of locality.

Our survey paper on allocators grinds this axe pretty hard---I think it's
an example of a very common failing in empirical computer science, which
is rushing to formalize complex things in simplistic ways, rather than to
focus on coming up with good *qualitative* models first.  This occurs in
architecture and OS work as well as in allocator research.  The current
understanding of locality is pretty pathetic---we still haven't gotten
much beyond LRU and Working Set and one-block-lookahead prefetching; 
those are all very superficial.

(Recent work on compiler-directed prefetching and similar stuff for database
queries is progress of a sort, but there's a huge gap between these very
specific tricks and the very vague generalities (e.g., LRU), which has not
been explored well at all.)

>] (Appel's new (FPCA '95) paper deals with the issue of interactions between
>] cache design and GC design.  It pretty much vindicates my much-attacked
>] LFP '92 paper.  As it turns out, everybody seems to be right---I'm right
>] for most architectures, but the "GC has good cache locality" camp (e.g.,
>] Reinhold; Tarditi, Diwan, and Moss; Appel's earlier work) is right if
>] you have a NICE cache.)
>
>This is the "understanding" problem: if the cace can notice that reading the
>line is not necessary when allocating a new object, it means the cache
>understands better the way the GC-allocator works. Your point about the
>memory bandwidth of the write-backs could be argued against in saying that
>there should be a way for the cache to notice that most of the write-backs
>aren't necessary (I'm assuming here that the generations' sizes have been
>carefuly chosen so that they "fit the memory hierarchy" (whatever this
>really means in terms of relative sizes): in such a case most write-backs
>will be for "old-space" objects that have been recognized as dead already).
>Since there will still be write-backs of objects that *are* dead but haven't
>been recognized as such yet, I guess manual allocators still have an edge.

Yes.  That's my point.  An allocator that knows when objects die always
has a locality advantage over one that waits for a GC to tell it.  This
may not matter much for a particular GC in a particular memory hierarchy,
but explicit deallocation always has the edge.  In practice, there's a
limit to how small you make the youngest generation of a GC, so for
small, fast caches you will have a lot of misses.  For larger caches,
you'll waste some space.

>But it might be balanced by the compacting effect
>of copying collectors.

I don't think so.  For small caches, by the time you've compacted data it's
too late---you touched too much memory in between GC's, because you couldn't
reuse memory immediately.

The memory lost to fragmentation in conventional allocators appears to be
widely overestimated---at least for the best allocators.  This is partly
because the allocator research area is such a mess that people use the
wrong allocators (e.g., next fit) or don't tweak them in the right ways
(e.g., getting rid of footer overhead for boundary tags).

Our recent results show good allocators having only about 10% fragmentation.
This is comparable to the losses you get just from a header word and rounding
object sizes up to a word or double word for architectural reasons.

(Some caveats are in order---this is for a set of eight varied C and C++
programs, but there are lots of kinds of programs that aren't represented
in *any* small set, and none of our programs runs for a very long time.)

It's not clear to me that compaction is a win, on average.  If objects
are placed correctly in the first place, it may not be necessary and it
may be *bad* for locality on average.  The allocation order and sizes
of objects are likely to provide very strong hints as to which objects
will be used together.  This info should be usable for decreasing
fragmentation and for increasing locality.

>For lack of hard data, this is probably blowing hot air, tho !
>But the point still holds: it'd probably be a good idea to provide ways to
>tell the memory hierarchy what's going on (like a "trash" instruction to
>indicate that some memory region can be trashed (to free the old-space) or a
>destructive read (can be used for linear-programming, or for stack-pops)).

I think cache-controlling instructions are a good idea, but I don't think
we're ready to say which ones are worthwhile.  We haven't exploited the
potential of allocator design to improve locality for *all* architectures,
and we haven't figured out what cache hints are most useful for the most
programs.

>] The bottom line, in my view, is that if you're targeting a range of
>] machines and can't pick your cache design, you should be very careful
>] about locality of reference during allocation.  Allocate on the stack
>] when you can do so safely, not on the heap---especially for very fast
>] processors, where cache stalls start to be really important.
>
>Too true: current systems are designed mainly to run stack-based programs, so
>it's always safer to try to stick as much as possible to the same paradigm.
>Appel's argument about GC and stack being just as fast (while the GC approach 
>is simpler since there *is* a GC anyway) is not convincing enough: the added
>complexity buys you predictability (and it's a "one-time" complexity cost).
>

I think the difference between good GC'd systems and good explicit allocators
is often overstated, partly due to the tendency to see locality in
simplistic terms.  GC'd and non-GC'd systems may use memory in pretty
similar ways, if both are well-designed.

Consider a FORTRAN program that uses large arrays (bigger than the cache)
repeatedly.  Often, such programs go through arrays periodically, filling
them with new values and grinding on them for a little while.

Now consider a GC'd system, that romps through memory allocating mostly
short-lived objects, then GC's and does it again.

Notice that at a high level of abstraction, these seemingly different
programs are doing the same thing---they're reusing memory to hold
short-lived data.  In the case of the FORTRAN array, it's the programmer
who decides that memory will be reused, by repeatedly using the same
language-level object to hold different data sets.  In the case of a
GC'd system, it's the garbage collector that decides, by mapping different
language-level objects onto the same range of addresses over time.

From the cache's point of view, however, it doesn't really matter.  A
dumb cache will stall a lot because the cache will be missed frequently
as areas of memory are romped through.

A prefetching cache will do better, by fetching the data ahead of time.
(It will tend to be bandwidth-limited, however, if the processor is fast.)
A write-allocate cache with one-word subblock placement will do better,
because it will notice that everything is being written before being read
from, and optimize away the fetching of blocks---it will create them out of
thin air in the cache, rather than fetching data that will only be
overwritten immediately anyway.  In all of these cases, there will be
a lot of write backs, and very fast processors will be bandwidth-limited.

A really good GC or really good FORTRAN compiler may be able to exploit
these patterns, if the cache is controllable, by noticing when things
become "dead" and telling the cache not to bother with most of the
write-backs.  The situations where they'll be able to do this are different,
though.

To figure out what computer architects ought to do, we need to have a
much better understanding of how programming idioms interact with compilers'
and allocators mappings of objects onto memory.

As an allocator and GC designer, it seems to me that the best bet is to 
figure out what's really going on, exploit what I can by controlling object
placement, and hope that it turns out that the same features that are
useful to me are also useful to C and FORTRAN people, because that's where
a lot of the market is.  (I'm pretty optimistic, though, because I suspect
a lot of the same behaviors show up in different kinds of systems, just
in different guises.)  In the short term, I think it's important to optimize
for the kinds of cache designs that are or will be widely used in the
short term---there's plenty of worthwhile work to be done there.

-- 
| Paul R. Wilson, Comp. Sci. Dept., U of Texas @ Austin (······@cs.utexas.edu)
| Papers on memory allocators, garbage collection, memory hierarchies,
| persistence and  Scheme interpreters and compilers available via ftp from 
| ftp.cs.utexas.edu, in pub/garbage or http://www.cs.utexas.edu/users/oops/)      
From: Frank Adrian
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <3vbcd5$stn@atheria.europa.com>
Paul Wilson (······@cs.utexas.edu) wrote:
: To figure out what computer architects ought to do, we need to have a
: much better understanding of how programming idioms interact with compilers'
                                               ^^^^^^
: and allocators mappings of objects onto memory.

I must say that I agree with your statement, although you misspelled the
seventh word on the second line of this paragraph.  The "m" should be a
"t" instead.

I also apologize for injecting this note of levity into what was an
extremely well-written (and quite accurate) post.  Although, when I
read this paragraph, I just couldn't resist :-).
___________________________________________________________________________
Frank A. Adrian           ancar technology            Object Analysis,
······@europa.com         PO Box 1624                   Design,
                          Portland, OR 97207              Implementation,
                          Voice: (503) 281-0724             and Training...
                          FAX: (503) 335-8976
From: Mike McDonald
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <3vbl70$bht@fido.asd.sgi.com>
  I should preface my comments with a statement stating I know next to
nothing
about modern allocators and GCs. The last time I studied any GC in
detail was for
a graduate compiler construction class at UIUC in 83/84. After I
presented my
report on Henry Baker's generational (I think) GC, the professor stated
that GC
had nothing to do with compilers. I wimped out and let it slide instead
of
challenging him.  That said,

In article <··········@jive.cs.utexas.edu>, ······@cs.utexas.edu (Paul
Wilson) writes:
|> In article <··········@info.epfl.ch>,
|> Stefan Monnier <··············@epfl.ch> wrote:
|> >[comp.arch added since the thread is drifting towards cache design]
|> 
|> Then I should restate a couple of things:  I'm very much pro-GC, for
|> software
|> engineering reasons, but I think that it ultimately costs you a
|> little
|> bit in locality, relative to a similarly good explicit deallocator
|> (malloc/free kind of thing).

  I too am in favor of GCs from a sofware engineering point of view.
I've spent
too much of my life already worrying about malloc/free in both my
programs and in
others.


|> It's not clear to me that compaction is a win, on average.  If
|> objects
|> are placed correctly in the first place, it may not be necessary and
|> it
|> may be *bad* for locality on average.  The allocation order and
|> sizes
|> of objects are likely to provide very strong hints as to which
|> objects
|> will be used together.  This info should be usable for decreasing
|> fragmentation and for increasing locality.

  I don't think allocation order and size gets you vary far. First,
allocation
order tells you nothing about usage order. The order of allocation is
very
dependant on the algoritms used and the style of the writer. Changing
the order
of a couple of C lines can drasticly change the order of allocation. As
an
example, let's say we want to build a tree structure. How many orderings
can you
come up with to build that tree? To keep it simple, let's just say depth
first vs
breadth first. These two algorithms give completely different allocation
orders
but result in the same logical structure. Allocation order gives us the
wrong
information. What we really want is access order. In this example, the
algorithm
used to walk the tree is far more important. After all, you only
allocate
something once. You may access it many times.

  Size doesn't give us much useful info either. Just because two
allocation
requests ask for the same amount of memory doesn't mean they are the
same
structure nor that they'll be accessed together. If you get two requests
for 32
bytes right in a row, what does that tell you? I'd say darn little. That
32 byte
object could be a string of length 32, an array of 8 pointers, or a 3d
vector
(x,y,z,w) of doubles. The other 32 byte object could be something
completely
different.

  I'd think a much more useful piece of information that the compiler
could
provide is which structures are going to point to this piece of memory
the
allocator is about to allocate. After all, you can't access something
until you
access the location of the pointer to that memory.

  The type of the object being allocated would also be nice. Then the
compiler
could build up some usage tables based upon static analysis. The
allocator could
then use that info along with the "who points to me" info to decide
where to
place the objects relative to each other. 

  For instance, if I have a type A that includes pointers to objects of
type B
and type C and the compiler determines that everytime I access an object
of type
A, there's a 80% chance that I'll access the B object and a 30% chance
I'll
access the C object, then when I allocate an object of type A, B, or C,
I can do
some interesting things. For instance, when I allocate an object of type
A, I may
want to reserve the space for the type B object too but not the type C
object.
Anytime later when I actually go to allocate the type B object, I can
return the
previously reserved space since I'm also given the pointer to the
initial A
object.

  Mike McDonald
  ·······@engr.sgi.com
From: Paul Wilson
Subject: Re: allocator and GC locality, and locality generally (was Re: cost of malloc)
Date: 
Message-ID: <3vbrrf$ege@jive.cs.utexas.edu>
In article <··········@fido.asd.sgi.com>,
Mike McDonald <·······@engr.sgi.com> wrote:
>  I should preface my comments with a statement stating I know next to
>nothing about modern allocators and GCs. The last time I studied any GC in
>detail was for a graduate compiler construction class at UIUC in 83/84.
>After I presented my report on Henry Baker's generational (I think) GC, the
>professor stated that GC had nothing to do with compilers.

Your professor was a bozo.  Unless he was somebody important, of course :-)

>I wimped out and let it slide instead of challenging him. 

Probably a good strategy under the circumstances, given that he was a bozo. :-)

Seriously, though, I think this is a major problem.  CS is too 
compartmentalized, and there are "compiler courses" most places which give
very short shrift to runtime systems.  (They also tend to focus on the
best-understood and hence most boring parts of compiling---e.g., spending
the first 2/3 of the course on parsing, most of the rest on intermediate
code generation and assembly, and the last day on optimization.  Not all
compiler courses, of course, but many, especially those taught at lesser
schools by older faculty.  Maybe it's changing, but not fast enough for my
tastes.)

>That said,
>In article <··········@jive.cs.utexas.edu>, ······@cs.utexas.edu (Paul
>Wilson) writes:
>|> In article <··········@info.epfl.ch>,
>|> Stefan Monnier <··············@epfl.ch> wrote:
>|> >[comp.arch added since the thread is drifting towards cache design]
>|> [ ... ]
>
> [...]
>
>|> It's not clear to me that compaction is a win, on average.  If
>|> objects are placed correctly in the first place, it may not be 
>|> necessary and it may be *bad* for locality on average.  The
>|> allocation order and sizes of objects are likely to provide very
>|> strong hints as to which objects will be used together.  This info
>|> should be usable for decreasing fragmentation and for increasing locality.
>
>  I don't think allocation order and size gets you vary far. First,
>allocation order tells you nothing about usage order.

First, let me me say that I oversimplified a bit.  The exact order isn't
what I'm mostly talking about, and isn't usually the most important for
memory hierarchies.

Also, in the best of all possible worlds, more information and more freedom
to use it are always better.  In the real world as it is, I think a lot
of the obvious information available to allocators goes unused, and that
allocators could be improved a lot.  Ultimately, funky memory hierarchy
designs and snazzy adaptive reorganization of the data may be a win,
but we're not exploiting the hardware and software structures we've got now.

It's also not clear that reorganizing things in an objectwise way in a
GC is the right way to do things.  It may well be that the right way
of doing things is mostly independent of whether you have a GC or
an explicitly managed (malloc/free) heap.

>The order of allocation is very dependant on the algoritms used and the
>style of the writer.

To some degree yes, and to some degree no.  Many differences don't really
matter much, and many similar behaviors arise in very different systems,
though maybe in different combinations and different proportions.  We
need to understand those things before we decide what the appropriate way
of exploiting them is.

(Our allocator survey grinds this axe pretty hard, too.  I talk explicitly
about a strategy/policy/mechanism distinction, as opposed to the usual
policy/mechanism distinction.  I think that's important, because the
strategies are not understood very well, much less the policies that
are supposed to implement the strategies.  The strategies should be
chosen on the basis of a deep understanding of regularities and irregularities
in workloads, which is almost completely absent in the literature on both
allocators and locality of reference.  I probably should have made a four-way
distinction: theory/strategy/policy/mechanism, where I mean "theory" in
the vernacular sense of "I think this is basically what's going on", not
in the CS "theory" sense of "I have a cute formalization that may or
may not have anything to do with reality.")

To be more concrete, I think that phase behavior is quite important to
locality.  Objects created at *roughly* the same time are likely to be
used for similar purposes, but maybe not if they are of different *types*.
If they are of different types, they are likely to be used for different
purposes---e.g., some may hold intermediate values of a computation, while
others hold results that outlive the phase that creates them.

Assuming we don't know the types of objects being allocated, as conventional
allocators don't, it's reasonable to use size heuristically to discriminate
between different types of objects created at about the same time.
It's not as good as real type information, but it'll probably get most of
the same benefit.  (In my personal opinion, that is.  Soon I may have data
that say whether I'm right.)

> Changing the order of a couple of C lines can drasticly change the order 
> of allocation.

Quite true.  On the other hand, it may not matter.  For example, a small
change in code may *reverse* the order in which a list is created or
traversed, but that may not matter much for locality.  As long as the
items in the list are clustered together in roughly sequential order,
you'll get a ton of spatial locality.  If they're interspersed with
other kinds of objects created by different phases, squirreled away here 
and there in holes in the heap, you're likely to damage locality.

See our allocator survey for more on this, and profiles of phase behavior
of real programs that suggest that this view is reasonable.

>an example, let's say we want to build a tree structure. How many orderings
>can you come up with to build that tree? To keep it simple, let's just say 
>depth first vs breadth first. These two algorithms give completely different
>allocation orders but result in the same logical structure.

Up to a point, I agree.  On the other hand, there are several arguments
(and even a little data) that go the other way.  Breadth-first and depth-first
actually both have some nice locality properties, and either may result in
pretty good clustering IF you keep the tree nodes belonging to a given tree
together in memory, rather than interspersing them with other data.  You
may want to segregated the data indexed by the tree separate from the
nodes of the tree itself (segregation by size will usually do this).

> Allocation order gives us the wrong information. What we really want is 
>access order. In this example, the algorithm used to walk the tree is far
>more important.

Yes and no.  If the tree is reasonably well localized---by any reasonable
heuristic---you may get most of the benefit of an order specialized for
a particular traversal.  For a small tree, this may mean the whole tree
fits in fast memory, where interspersing different kinds of data would
increase its "footprint".  For a larger tree, you want something that
clusters the upper nodes together, mostly, and clusters the data it
indexes reasonably well, too.  In this case, you'll do well for lots
of different access patterns.  Sparse searches of the tree will take
a minimal number of faults, and exhaustive traversals of the tree will
do okay with only a few pages of memory.

A lot of this can be controlled in software, by the choice of the tree
structure, rather than by a GC or fancy hardware.  For example, in the
Texas persistent store for C++, we use B+ trees rather than binary trees.
A B+ tree can be viewed as a binary tree where most of the levels have
been collapsed to cluster strongly-related nodes within a page.  I think
a lot of data structure work is just optimizing the wrong thing---locality
is increasingly important, and we should be optimizing for that as well
as raw CPU speed.

The real booger for this theory is trees that are created in random order.
I don't think that happens very often, though, or at least it won't if
people are careful to code their programs right.

For one thing, most data simply aren't unpatterned.  The patterns may be
unknown, but they're NOT random.  Many seemingly different kinds of patterns
have similar locality implications, as long as they're not truly random.
(Yet another axe we grind in the allocator survey.)

>After all, you only allocate something once. You may access it many times.

Yes, but I claim (without proof) that the phase structure of a program
usually gives strong hints as to later access patterns---maybe not the
exact order of accesses, but enough information to cluster objects reasonably
and generate spatial locality.

>  Size doesn't give us much useful info either. Just because two
>allocation requests ask for the same amount of memory doesn't mean they
>are the same structure nor that they'll be accessed together.

True, but see the above.  Size alone isn't enough, but size and ordering
together can tell you a lot.

>If you get two requests for 32 bytes right in a row, what does that tell
>you? I'd say darn little. 

I'd bet big money you're wrong. :-)  Of course, that's unfair---I've been
researching this kind of thing for years, and you're raising very good
questions.  (In some sense, I am betting big money on this.)

>That 32 byte object could be a string of length 32, an array of 8 pointers,
>or a 3d vector (x,y,z,w) of doubles.  The other 32 byte object could be
>something completely different.

It *could* be, but it usually isn't, and that's good enough for real
systems.  The majority of the time, two consecutive requests for the
same size *are* to allocate the same type of object, and for a similar
*purpose*, because of the phase structure in most programs.

>
>  I'd think a much more useful piece of information that the compiler
>could provide is which structures are going to point to this piece of
>memory the allocator is about to allocate.

I don't think so.  Compilers are good at very local analyses, but they're
usually pretty bad at the kinds of things that matter to locality.
(An exception to this is compiler-directed prefetching.)  A compiler
may be able to easily look 100 CPU cycles into the future, but less
often 1000, and seldom a million.  So you may be able to prefetch the
next few things you need, but the compiler generally won't be able to
tell you how to cluster things for good locality at larger granularities.
(Another exception is simple algorithms over very large arrays.  Compilers
for scientific codes can get that right sometimes.)

> After all, you can't access something until you access the location of
> the pointer to that memory.

Yes, but by then it's too late.  This is the locality problem of heap-allocated
data---because of the arbitrary indirections, looking a few instructions
ahead can only tell you what happens a few instructions ahead, not the 1,000
or 10,000 you'd really like---or the millions you'd like in the case of
virtual memory systems, to do a disk seek ahead of time.  For virtual
memories, there's generally not enough "overlap" to do any good, which is
why virtual memory systems typically only do sequential demand prefetching,
i.e., wait until a fault and grab the next thing, too.

Clustering is the flip side of prefetching, and often much easier to
implement on dumb hardware.

Clustering is where it's at, in my view---you want the right data grouped
together so that a dumb fetch of a block or page will grab a lot of useful
data.  It doesn't have to be data that you'll touch in the next 100
instruction cycles.  If you use it within 10,000 instruction cycles, it
will usually be good.  (For virtual memories, if you use it within the
next *billion* instruction cycles it will usually be good;  compilers
are very bad at dealing with patterns on that scale, but runtime systems
can be very good at it.)

The key thing here is that (pre)fetching doesn't have to make *good*
predictions of the exact order of references to do an excellent job.
If it makes *passable* (significantly better than random) predictions,
it often works excellently.

>  The type of the object being allocated would also be nice.

Yes.  I agree completely---but I think that most systems don't take
advantage of the information that's staring them in the face.  Order
and size can be exploited much more than current systems do, and there
are other sources of information that don't require fancy system structures
or specialized hardware.

>Then the compiler could build up some usage tables based upon static
>analysis.

You have to be very careful about this.  This is a trap a lot of people
fall into.  Just counting the frequencies of usage is often misleading.
It's the ways in which things are used---particularly large-scale
patterns---that are most relevant to memory hierarchies.

This is one of the problems we get because of the overemphasis on
compilers and the almost complete deemphasis of runtime systems.  Runtime
systems have a very different kind of information available, and for
some purposes, it's *better* information.

(It may sound like I'm arguing for dynamically adaptive runtime systems,
but I think that's a trap, too---it's often better to take advantage
of known regularities without literally being "adaptive" at runtime.
People often design "adaptive" algorithms that are wrong in subtle
ways, and don't know why they don't work.  There are often simple
regularities that can be exploited without any fancy pattern recognition
and adaptation---see our allocator survey for an example.)

>The allocator could then use that info along with the "who points to me"
>info to decide where to place the objects relative to each other. 

I think this is a reasonable idea, but not the easiest to implement
fruitfully.  (It's been tried, without much success, although that's
often not an indicator of the true merit of an idea.)

>  For instance, if I have a type A that includes pointers to objects of
>type B and type C and the compiler determines that everytime I access an
>object of type A, there's a 80% chance that I'll access the B object and
>a 30% chance I'll access the C object, then when I allocate an object of
>type A, B, or C, I can do some interesting things. For instance, when I
>allocate an object of type A, I may want to reserve the space for the
>type B object too but not the type C object.  Anytime later when I actually
>go to allocate the type B object, I can return the previously reserved 
>space since I'm also given the pointer to the initial A object.

This is a very appealing and seductive idea, but I think it's wrong.
(Many researchers in locality have been seduced by this kind of reasoning,
so don't feel bad :-).)

You have to take scale into account.  In locality, it usually doesn't
matter how many times you do things in the very short term---what matters
is *whether* you do things at all on a somewhat longer timescale.  (E.g.,
thousands, millions, or billions of instruction cycles.)

This is analogous to the issue of frequency-based vs. recency-based
replacement.  A whole lot of people find frequency-based replacement
very appealing, but it seldom works as well as simple LRU replacement.

LFU replacement evicts the least frequently used item, where LRU replacement
evicts the least recently used.  LRU is much better, because it's
"timescale relative"---it only notices touches that matter on the
timescale that's relevant to the memory in question.

Consider the following memory access patterns for two different pages:

*      *      *      *      *      *      *      *      *      *       *      *

******               *******                *******                ********

LRU wins on this because it will usually evict the BOTTOM page, and have
to fault it back into memory three times.  LFU loses, because it usually
evicts the top page---which is touched fewer times---and has to fault
it back in 11 times.

The reason why LRU is (very roughly) right and why LFU is (usually) wrong
is that short-term access patterns don't matter nearly as much as medium-
and long-term patterns.   When I fault a page or block into memory, it
doesn't matter whether I touch it twice, or 200 times in quick
succession---what matters is whether I fault on it at all.  Subsequent
short-term access patterns are irrelevant because the item will be in
memory for a little while for any reasonable policy, and repeated short-term
touches just don't matter.  (The last one may matter, because it affects
the LRU ordering and hence which block will be evicted first, but all
of the intervening short-term touches not only don't matter much, they
don't matter *at*all*.)

This illustrates the problem of expecting the compiler to figure out
locality for you---a compiler has a short-term view, and locality is
a "big picture" issue.   The same kind of problem comes up when you
rely on the compiler to tell you how often certain things happen---the
compiler can tell you how often certain concrete events are likely
to happen, but it can't tell you which of those things *matter* on
the timescale that matters.

>  Mike McDonald
>  ·······@engr.sgi.com


-- 
| Paul R. Wilson, Comp. Sci. Dept., U of Texas @ Austin (······@cs.utexas.edu)
| Papers on memory allocators, garbage collection, memory hierarchies,
| persistence and  Scheme interpreters and compilers available via ftp from 
| ftp.cs.utexas.edu, in pub/garbage or http://www.cs.utexas.edu/users/oops/)      
From: Paul Wilson
Subject: Re: allocator and GC locality, and locality generally (was Re: cost of malloc)
Date: 
Message-ID: <3vcne8$etl@jive.cs.utexas.edu>
In article <··········@fido.asd.sgi.com>,
Mike McDonald <·······@engr.sgi.com> wrote:
>In article <··········@jive.cs.utexas.edu>, ······@cs.utexas.edu (Paul
>Wilson) writes:
>|> It may well be that the right way
>|> of doing things is mostly independent of whether you have a GC or
>|> an explicitly managed (malloc/free) heap.
>
>  Are we talking about one implementation of malloc/free used for all
>programs or are we talking about customized malloc/free on a per program
>basis?

Mostly the former, but the latter as necessary, and I agree it's likely
to be necessary sometimes.  I suspect it won't be necessary often, though,
because I think one allocator can combine strategies so that it works
well for a variety of different program behaviors.

>Trying to
>come up with a general purpose allocator/GC is a quite different problem
>than coming up with a methodology for generating "custom" allocators/GC
>that use program specific knowledge. I'm gathering you're refering to the
>former.

Yes.  Custom-generated allocators are probably a good idea for a minority
of programs, but I think we've only scratched the surface of what a
really good allocator can do for most programs.  An intermediate strategy
is to have a covering set of two or three good allocators, all of which
usually do well, and at least one of which does really well for almost
any program.  Then you can just link against different allocators and see
what works best for your application.

This is fairly common in the sense that some people pick between a very
fast allocator---e.g., Chris Kingley's BSD 4.2 allocator---and a slower
but more space-efficient allocator.  But that sort of brutal space/time
tradeoff just shouldn't be necessary.  There are some fairly
simple techniques that should make almost any reasonable allocator
reasonably fast.  With deferred coalescing, you can probably get most
of the speed advantage of a very simple allocator, and most of the
space advantage of a more sophisticated one.)

>I'm interested in the latter since people are always "comparing" GC to "hand
>optimized" allocation. (It's my experience that very few of the people
>who claim that they can't use a GC due to its high overhead compared to
>the "hand optimized" allocation actually do the optimization. They just 
>use malloc instead.)

I agree that this is true in many cases.  But some people really do use
hand-tweaked allocation (often bypassing malloc entirely);  they use ad
hoc allocators they kruft up for a particular situation.  Sometimes this
is a good idea, but I think that most of the time it's not---people
should just find a better general-purpose allocator and use that.  This
is especially common in C++, where people go nuts overloading new and
making class-specific allocators for a single kind of object, which often
gets rid of internal fragmentation but *creates* external fragmentation.
Yuck.

(Not that overloading new is a bad idea in general---we do it in our
persistent store for C++ and in our C++ interface to our real-time GC.
But don't try it at home---it's for trained professionals on a closed
course :-), and it shouldn't be necessary just to get decent allocator
performance.  Especially since it tends to conflict horribly with
better-motivated reasons for overloading new, such as persistence and
GC.)

Hand-krufted or specialized allocators (like obstacks) do very dangerous
things to the structure of programs, and can lead to *very* weird bugs.
(I've seen some class-specific "pool" allocators that don't work for
subclasses of that class because the sizes are different---aargh!  There
goes your nice extensible program design...)  If there were better general
allocators around---and appreciated---it wouldn't be so attractive to
monkey with the works.

(Aside to people who do this sort of thing: there ARE some pretty good
allocators out there, such as Doug Lea's segregated fits allocator, which
is FREE.  If you don't like the allocator you've got, try a different
one.  If it doesn't run on your system (e.g., Lea's runs on UNIXES),
PORT IT---it's not that hard.  I hope that soon we'll have a new and
nifty allocator to give away, too, but probably not for a few months.)

>|> >The order of allocation is very dependant on the algoritms used and
>|> >the style of the writer.
>|> 
>|> To some degree yes, and to some degree no.  Many differences don't
>|> really matter much, and many similar behaviors arise in very different
>|> systems, though maybe in different combinations and different proportions. 
>|> We need to understand those things before we decide what the appropriate
>|> way of exploiting them is.
>
>  True. I'd also say the corollary "Similar algorithms can have vastly
>different allocation behaviors" is also true.

Yes, but I think a single allocator (or two or three at most) can cover
most of the important behaviors in real programs.  I could be wrong.  We'll
see.

But again, a lot of the time the differences don't make a difference.  If
you can cluster related data---and not intersperse them with unrelated
data---you generally win.  It doesn't matter as much what the exact
access patterns are, or which clusters are accessed more often, as long
as birds of a feather flock together.  (That is, you can often win by
discriminating between things that are accessed differently, rather than
by predicting how they'll be accessed.)  And I think that, by and large,
you can use phased creations of objects as a fair correlate of phased
accesses to those objects, and get a win there too.

This is the basic idea behind "zone" allocators, which allow the programmer
to specify which objects should be lumped together in some subset of memory,
and that works for many programs.  On the other hand, I think that a smart
allocator should be able to do that discrimination pretty well, automatically.

(Some zone allocators have problems with external fragmentation, because
free memory can't be shared between zones.  A smart allocator should be
able to mostly segregate unrelated objects, but still share large hunks
of free memory---e.g., whole free pages---between the different subheaps.)

>|>  [...]
>|> To be more concrete, I think that phase behavior is quite important
>|> to locality.  Objects created at *roughly* the same time are likely to
>|> be used for similar purposes, but maybe not if they are of different
>|> *types*.
>|> If they are of different types, they are likely to be used for
>|> different
>|> purposes---e.g., some may hold intermediate values of a computation,
>|> while
>|> others hold results that outlive the phase that creates them.
>
>  I'm assuming by "phase behaviour" and "phase structure" you're meaning
>that programs tend to go thru distinct steps, such as the initialization
>phase, the update phase, ...

That's part of it, but those larger phases often have many smaller phases
with distinctive usage of just a few sizes of objects.  Phase structure
occurs at a variety of scales, and exploiting small phases could be
a big win.

> (I've printed out your survey papers. I just
>haven't read them yet today.)

Given the page counts, you're forgiven.

>|> Assuming we don't know the types of objects being allocated, as
>|> conventional allocators don't, it's reasonable to use size 
>|> heuristically to discriminate between different types of objects 
>|> created at about the same time.
>|> It's not as good as real type information, but it'll probably get
>|> most of the same benefit.  (In my personal opinion, that is.  Soon
>|> I may have data that say whether I'm right.)
>
>  Why don't the allocators have the type information? The compilers
>usually do.

It's mostly because of the dumb way that allocators are usually linked
with applications, which has some nice advantages.  Type info IS pretty
easy to get, and may be useful.  I'm just grinding the axe that current
allocators don't exploit (or don't *intentionally* exploit) a lot of
important information they're already getting.  I think we can get most
of the big wins without changing that situation, but if need be, we
can use compiler cooperation.

On the other hand, there are other sources of info that seem to me to
be more promising, such as the call site and the stack height.  These
are better clues as to what the program is up to.  Barrett and Zorn
have done some work on this kind of thing, as have Demers, Boehm,
Hayes et al. (of Xerox PARC) in GC'd systems.  I think these are good
ideas, but not really necessary and maybe a little premature, since we
haven't explored the information normal allocators are given, through
the normal interface---which I think is a gold mine.

>I don't think the allocator needs to know the exact structure of the
>objects but knowing that this call is allocating the same type as that 
>previous call seems like it would be useful and easy to implement. Just 
>have the compiler assign a unique id to each type and pass that id along
>to the allocator.

Yes, this is easy.  I just think it's unnecessary, because successive
allocations of the same size usually are allocations of the same type
of object, even if other phases of the program allcocate different objects
of the same sizes.  And if you're going to discriminate further, you may
want to use call site (or call chain) info instead.

>
>|> >an example, let's say we want to build a tree structure. How many
>|> >orderings can you come up with to build that tree? To keep it simple, 
>|> >let's just say depth first vs breadth first. These two algorithms 
>|> >give completely different allocation orders but result in the same
>|> >logical structure.
>|> 
>|> Up to a point, I agree.  On the other hand, there are several
>|> arguments (and even a little data) that go the other way.  
>|> Breadth-first and depth-first actually both have some nice locality
>|> properties, and either may result in pretty good clustering IF you
>|> keep the tree nodes belonging to a given tree together in memory, 
>|> rather than interspersing them with other data.  You may want to 
>|> segregate the data indexed by the tree separate from the
>|> nodes of the tree itself (segregation by size will usually do this).
>
>  I guess I must write weird programs. Most of my data structures tend
>to be about the same size. After the compiler gets done rounding the
>sizes off to double word alignment, I'd guess that there's a large 
>redunduncy in the sizes.

I don't think so, from the data I've been looking at.  (I don't have hard
numbers on this at the moment---or at least not in the right form yet.)
I'm pretty sure that consecutive objects of the same size are usually
of the same type, due to skews in size usage during small-scale phases.
(Have a look at some of the memory-use profiles in our allocator survey.)

Also, in my understanding, it's usually not the compiler that does the
rounding (although the compiler does padding)---doubleword alignment is
usually enforced by the allocator;  at least some of the information is
usually passed along through the allocator interface, although padding
may filter out some of it.

>|> > Allocation order gives us the wrong information. What we really
>|> > want is access order. In this example, the algorithm used to walk
>|> > the tree is far more important.
>|> 
>|> Yes and no.  If the tree is reasonably well localized---by any
>|> reasonable heuristic---you may get most of the benefit of an order
>|> specialized for a particular traversal.  For a small tree, this may 
>|> mean the whole tree fits in fast memory, where interspersing different 
>|> kinds of data would increase its "footprint".  For a larger tree,
>|> you want something that clusters the upper nodes together, mostly,
>|> and clusters the data it indexes reasonably well, too.  In this case, 
>|> you'll do well for lots of different access patterns.  Sparse searches 
>|> of the tree will take a minimal number of faults, and exhaustive
>|> traversals of the tree will do okay with only a few pages of memory.
>
>  If you can identify which allocs belong to the nodes and which belong
>to the data attached to the node, and if while accessing the nodes of the 
>tree you don't need to access the data items, then I'll agree that just
>about any way you walk the tree has about the same locality "niceness"
>as any other. I just don't believe that size and order give you enough
>information to make that fine of a seperation.

Okay, I'll try to come up with some data about small-scale phase behavior
that's more convincing.

>|> A lot of this can be controlled in software, by the choice of the
>|> tree structure, rather than by a GC or fancy hardware.  For example,
>|> in the Texas persistent store for C++, we use B+ trees rather than 
>|> binary trees.  A B+ tree can be viewed as a binary tree where most
>|> of the levels have been collapsed to cluster strongly-related nodes 
>|> within a page.  I think a lot of data structure work is just optimizing
>|> the wrong thing---locality is increasingly important, and we should
>|> be optimizing for that as well as raw CPU speed.
>
>  I totally agree. Memory bandwidth between the various components is
>almost always a bottleneck. Except benchmarks of course. Most of those
>fit in everyones' on chip cache.
>
>|> The real booger for this theory is trees that are created in random
>|> order.  I don't think that happens very often, though, or at least
>|> it won't if people are careful to code their programs right.
>
>  Wait a minute! Now we're relying on the programmers to do something
>smart? If that were the case, they'd all be writing their own allocators
>by hand for each program instead of just using malloc. But they don't.
>They want something that's easy to use and that will do what they perceive
>as a reasonable job without them having to think about it.

I agree that this sounds like I'm talking out of both sides of my mouth,
which I am, to some degree.  But I think data structure designers ultimately
have to take the fall for some things.  There are some things that are just
easier to fix in the data structures than in the compiler and/or runtime
system.  Too much data structure work is still targeting early 1960's
architectures---no VM, no cache.

It's time data structure and algorithm designers moved into the real world.
Libraries should provide alternative implementations of basic collection
types with different *locality* characteristics for large collections. 
It's just way easier for a data structure designer to get this right in
the first place than for a compiler and/or runtime system and/or hardware
to recover information that was thrown away before a line of code got
written.  In a sense, a b-tree is just a more natural structure for a
large hierarchical memory than a binary tree is.  Of course, for mostly
in-memory use you need to pay more attention to optimizing searches
within a node than database people usually do.

>|> For one thing, most data simply aren't unpatterned.  The patterns may
>|> be unknown, but they're NOT random.  Many seemingly different kinds of
>|> patterns have similar locality implications, as long as they're not truly
>|> random. (Yet another axe we grind in the allocator survey.)
>
>  I think I agree with this. Now, can one method do a reasonable job for
>all of these patterns or do we need a different method for each pattern?

Almost the former, I think---we should be able to deal with most common
patterns pretty well in one allocator, but we may need a couple of backup
allocators that handle patterns that are uncommon but not ignorably
rare, and we may have to resort to customized allocators for really
ornery programs.

>I believe that later is true. If so, how do you tell what the pattern
>is before you encounter it? If you leave it all up to the allocator, it
>has to deduce the pattern from one call at a time and then adjust it
>behavior accordingly. But it's too late then since it has already allocated
>things. I think it needs hints before it starts.

Maybe yes, maybe no.  I think that by and large, a fairly simple combination
of fairly simple strategies is likely to work nearly as well for most
programs as a custom strategy.  The trick is to identify the best and most
robust strategies and combine them in a way such that they don't undermine
each other.  This doesn't have to be explicitly adaptive---it may "just
work out" that one feature comes into play in a more important way in
some circumstances, and another in different circumstances.  I have a
hunch (which you're quite reasonable to be skeptical of) that the good
strategies aren't generally diametrically opposed, and can be combined
reasonably well.

(Also, adaptive tricks may be helpful sometimes---sometimes
fragmentation is a cumulative problem due to small-scale phase behavior.
An allocator could conceivably notice the small scale phases and limit
the problem before too many of the troublesome phases occur.  I'm hoping
we don't have to do anything that hairy.  (Adaptive algorithms tend not
to be very robust, unless you *really* know what you're doing.  It's
easy to adapt to one thing and shoot yourself in the foot with respect
to another.)

As an analogy, consider LRU replacement policies.  LRU is a simple policy,
but one that "adapts" reasonably well to a very wide variety of memory
referencing patterns exhibited by real programs.  You can beat it in
any number of special cases, but on average it's really pretty good.
(A lot better than anybody had a right to expect, given the simplistic
notions of locality that were around when it was proposed.)  People
were quite skeptical that LRU was robust enough for real use, but it
turns out that there are some deep reasons why it's very robust.
(At least, compared to FIFO, RANDOM, LFU, and a bunch of other things
that were tried.)

I think that there will be allocator policies that are similarly robust,
and for some reasons that are similar at a very high level of absraction:

  1. Programs aren't out to get you;  they're artifacts designed to
     solve problems, not to foil a memory allocator.  As long as you
     don't do things that lose big, you'll usually do pretty well.
     The bad cases don't appear to occur very often, if you don't
     do something stupid and shoot yourself in the foot.

  2. Programs exhibit locality in their allocation behavior, as well
     as in their memory-referencing behavior.  They are patterned in
     ways that should be exploitable---we've already got allocators
     that work pretty well for most programs, and there's no reason
     to think that they're the best we can do.  (The process that
     resulted in the allocators we've got was not very scientific or
     empirical.  I have to guess that when we figure out what's really
     going on, we can do at least a little bit better.)

  3. In terms of locality of reference, at least, it's more important
     not to lose big on any common patterns than it is to exploit
     every bit of information to the fullest.  This has to do with
     the granularity issues discussed in my last posting, and is why
     LRU almost always works better than LFU.  You don't have to make
     very precise predictions (and LRU doesn't) as long as you can
     make reasonably good discriminations.  (LRU does this by not
     screwing up small-scale stuff: it NEVER evicts something that's
     very recently touched;  its mistakes tend to be in the longer-term
     patterns, which occur much less often and therefore don't pose
     as much of a danger, on average.)

I know this is all pretty vague;  you'll have to wait if you want convincing
proof.

>|> >After all, you only allocate something once. You may access it many
>|> times.
>|> 
>|> Yes, but I claim (without proof) that the phase structure of a
>|> program usually gives strong hints as to later access patterns---maybe
>|> not the exact order of accesses, but enough information to cluster 
>|> objects reasonably and generate spatial locality.
>
>  I guess I disagree here. From my experience, in a lot of systems,
>different phases were written by completely different people, sometimes
>completely different orginizations. To assume that they'll have similar
>locality behaviour is too much for me.

I can only say that from the limited data I've seen, and my own intuitions,
I think that a fairly simple combination of strategies can exploit a
wide variety of patterns.  As I said earlier, this is partly because
some "different" kinds of patterns can be exploited by a single simple
strategy, and because I don't think the good strategies will turn out
to be diametrically opposed.

>|> Clustering is where it's at, in my view---you want the right data
>|> grouped together so that a dumb fetch of a block or page will grab
>|> a lot of useful data.  It doesn't have to be data that you'll touch
>|> in the next 100 instruction cycles.  If you use it within 10,000
>|> instruction cycles, it will usually be good.  (For virtual memories,
>|> if you use it within the next *billion* instruction cycles it will 
>|> usually be good;  compilers are very bad at dealing with patterns on
>|> that scale, but runtime systems can be very good at it.)
>
>  Within the next 10,000 of your instruction cycles? Maybe. Within a
>billion? No way. Never.

I'm not sure you realize how much faster CPU's are than disks---most
people don't, and it's kind of amazing when you think about it.

Consider a 100 MIPS processor with 32 MB of RAM and a 10 msec average
disk access time (seek plus rotational latency plus transfer time).
This is what I've got on my desk.  If I have 4KB pages and fetch one
page at a time at 10msec each, it takes a while to fault in a memory
full of pages and flush all of the previously resident pages.  (For
simplicity, assume I do something nasty like straight-line accesses
of a huge amount of memory.)  This is roughly the timescale relevant
to virtual memory replacement policies.

At 10msec per page, I can fetch 100 pages in a second.  32MB of 4KB
pages is 4000 pages, or 40 seconds replacement time, *minimum*.
That is, a page will stay in memory for at least 40 seconds after
it is last touched, even if I'm touching a different nonresident
page at every memory reference, i.e., if I have no usable locality and
am thrashing my brains out.

With a 40-second replacement time, at 100 million instructions a
second, that's 4 billion instructions between the last time I touch
a page and the time it gets evicted.

Now, this is all oversimplified.  If I'm really doing straight-line
accesses or effective prefetching, I may be limited only by my SCSI
bandwidth, and be able to get something close to 4MB/sec.  In that
case, I'm under half a billion instructions.  And I may lose some
memory to nonpageable OS stuff, so maybe I only get a quarter billion.  

On the other hand, the gap between CPU and disk speeds is widening;
there are processors that are several times faster than mine, but
few disks that are that many times faster than my disk.  (RAIDs
help, bandwidth-wise, but if a significant fraction of my pageins
require seeks, I'm hosed.)  And most of the time, there's some locality,
so you don't tend to flush everything as fast as possible.  So billions
of instructions is about the right ballpark, for fast machines in the
next few years.

The bottom line is that for VM prefetching to be very effective at all,
I've got to make predictions at least a hundred thousand instructions
in advance.  On the other hand, it's just fine to predict things many
millions of instructions in advance, because bringing them in early
won't waste memory for very long on the scale of VM replacement. 

This implies that clustering is more attractive than true prefetching,
because if I can group together things that are touched at very roughly
the same time (e.g., within a few seconds of each other), I win.  I
don't have to make very short-term predictions, and I don't have to
predict very precisely.

> All of those other programs running with bad locality are going to
>use that LRU algorithm to flush your data back out to disk. 

Nope, the disk can't keep up, and the processes will just be blocked
most of the time.

>Heck, you'd probably flush them out yourself.

Again, the disk just won't go that fast---I can't flush pages any faster
than I can fault pages in.

>I'm assuming you are doing something in that billion cycles.

Not usually---these days I (and many people) usually only have one
dominant process that uses most of the memory, and if you're paging
much at all, you're thrashing.  There are lots of machines that
are more time-shared, but they often have more RAM as well, and the
trend is toward fewer and fewer big processes per processor.

>|> The key thing here is that (pre)fetching doesn't have to make *good*
>|> predictions of the exact order of references to do an excellent job.
>|> If it makes *passable* (significantly better than random)
>|> predictions, it often works excellently.
>|> 
>|> >  The type of the object being allocated would also be nice.
>|> 
>|> Yes.  I agree completely---but I think that most systems don't take
>|> advantage of the information that's staring them in the face.  Order
>|> and size can be exploited much more than current systems do, and
>|> there are other sources of information that don't require fancy system
>|> structures or specialized hardware.
>
>  I don't think type does either. (See above.)

Agreed.  Type info isn't really hard to get.  I just don't think it's
much better than the info allocators are getting already;  if it turns
out to be, I'll certainly be happy to use it.
>
>|> This is one of the problems we get because of the overemphasis on
>|> compilers and the almost complete deemphasis of runtime systems. 
>|> Runtime
>|> systems have a very different kind of information available, and for
>|> some purposes, it's *better* information.
>
>  True. But the problem with runtime info is that you've already
>committed yourself before the patterns are apparent. I think local, static
>analysis can help here. True, it doesn't help with global issues. But
>once again, some knowledge is often better than none. (It's also often worse.)
>

In principle, I think I agree.  It's just that in practice, we should be
able to do a good job with the information at hand, once we understand
its significance.  Later we can go looking for more information to get
further improvements.  You have to be careful that gathering the and
exploiting the information isn't more expensive than it's worth, which
is why I'm biased towards simple solutions--but based on sophisticated
knowledge about why they work.  That's where the hard work comes in...


-- 
| Paul R. Wilson, Comp. Sci. Dept., U of Texas @ Austin (······@cs.utexas.edu)
| Papers on memory allocators, garbage collection, memory hierarchies,
| persistence and  Scheme interpreters and compilers available via ftp from 
| ftp.cs.utexas.edu, in pub/garbage or http://www.cs.utexas.edu/users/oops/)      
From: Mike McDonald
Subject: Re: allocator and GC locality, and locality generally (was Re: cost of malloc)
Date: 
Message-ID: <3vjb55$sge@fido.asd.sgi.com>
In article <··········@jive.cs.utexas.edu>, ······@cs.utexas.edu (Paul Wilson) writes:

|> Yes.  Custom-generated allocators are probably a good idea for a minority
|> of programs, but I think we've only scratched the surface of what a
|> really good allocator can do for most programs.  An intermediate strategy
|> is to have a covering set of two or three good allocators, all of which
|> usually do well, and at least one of which does really well for almost
|> any program.  Then you can just link against different allocators and see
|> what works best for your application.

  In my experience, the system usually only comes with one. If you want to try
something different, you're on your own. That tends to lead people to build their
own from scratch rather than experiment with using already proven ones. I think
this is a pretty sad state of affairs.


|> >I'm interested in the latter since people are always "comparing" GC to "hand
|> >optimized" allocation. (It's my experience that very few of the people
|> >who claim that they can't use a GC due to its high overhead compared to
|> >the "hand optimized" allocation actually do the optimization. They just 
|> >use malloc instead.)
|> 
|> I agree that this is true in many cases.  But some people really do use
|> hand-tweaked allocation (often bypassing malloc entirely);  they use ad
|> hoc allocators they kruft up for a particular situation.  Sometimes this
|> is a good idea, but I think that most of the time it's not---people
|> should just find a better general-purpose allocator and use that.  This
|> is especially common in C++, where people go nuts overloading new and
|> making class-specific allocators for a single kind of object, which often
|> gets rid of internal fragmentation but *creates* external fragmentation.
|> Yuck.

  A good number of the "hand optimized" allocators I've seen people create
actually perform worse than the standard malloc that came with their system. I
think a lot of this is due to the fact that we all think we know how our code
behaves but we've never bothered to verify it. As a result, we build upon
assumptions that aren't true. As your paper points out in the case of random
traces.

|> Hand-krufted or specialized allocators (like obstacks) do very dangerous
|> things to the structure of programs, and can lead to *very* weird bugs.
|> (I've seen some class-specific "pool" allocators that don't work for
|> subclasses of that class because the sizes are different---aargh!  There
|> goes your nice extensible program design...)  If there were better general
|> allocators around---and appreciated---it wouldn't be so attractive to
|> monkey with the works.

  This was when I got called in. The built a kludgy system that was unreliable
and they needed someone to "fix" the bug. They usually didn't like the answer
that their whole system was the "bug". :-)

|> (Aside to people who do this sort of thing: there ARE some pretty good
|> allocators out there, such as Doug Lea's segregated fits allocator, which
|> is FREE.  If you don't like the allocator you've got, try a different
|> one.  If it doesn't run on your system (e.g., Lea's runs on UNIXES),
|> PORT IT---it's not that hard.  I hope that soon we'll have a new and
|> nifty allocator to give away, too, but probably not for a few months.)

  The state of affairs in software "engineering" just plain sucks! (I can't think
of any nicer way to put it.) People go off and study basic problems, spending
lots of time and effort into charactterising the problem, coming up with
solutions and to what effect? Does anyone use their results? Has anyone heard of
their work? Nope, we all just reinvent everything from scratch everytime. Pitiful.

|> >  True. I'd also say the corollary "Similar algorithms can have vastly
|> >different allocation behaviors" is also true.
|> 
|> Yes, but I think a single allocator (or two or three at most) can cover
|> most of the important behaviors in real programs.  I could be wrong.  We'll
|> see.

  I'll agree to that for now.

|> > (I've printed out your survey papers. I just
|> >haven't read them yet today.)
|> 
|> Given the page counts, you're forgiven.

  I've just about finished reading it. (Expect for pages 48 thru 56. The bottom
half of each page didn't get printed for some reason. I'll have to reprint them.)

  I'm a bit confused by one thing. In the beginning of the paper, you state that
reducing fragmentation is the most important thing for a allocator to do. In this
thread, I get the impression that you now think locality is more important. Care
to elaborate?

|> It's mostly because of the dumb way that allocators are usually linked
|> with applications, which has some nice advantages.  Type info IS pretty
|> easy to get, and may be useful.  I'm just grinding the axe that current
|> allocators don't exploit (or don't *intentionally* exploit) a lot of
|> important information they're already getting.  I think we can get most
|> of the big wins without changing that situation, but if need be, we
|> can use compiler cooperation.

  I'll agree that we're not using the info we have to the fullest. But if type
info is easy to get, why not see if it's helpful while we're at it? [I know. One
has to sleep once in a while too. :-)]

|> On the other hand, there are other sources of info that seem to me to
|> be more promising, such as the call site and the stack height.  These
|> are better clues as to what the program is up to.  Barrett and Zorn
|> have done some work on this kind of thing, as have Demers, Boehm,
|> Hayes et al. (of Xerox PARC) in GC'd systems.  I think these are good
|> ideas, but not really necessary and maybe a little premature, since we
|> haven't explored the information normal allocators are given, through
|> the normal interface---which I think is a gold mine.

  I found these ideas intrigging! I'm taking that the point of these is that if
the allocator is called from the some location on two different occasion, it's
probably allocating the same type of object (or a "closely" related one). (Your
paper never explicitly states this.)


|> Yes, this is easy.  I just think it's unnecessary, because successive
|> allocations of the same size usually are allocations of the same type
|> of object, even if other phases of the program allcocate different objects
|> of the same sizes.  And if you're going to discriminate further, you may
|> want to use call site (or call chain) info instead.

  But two succesive calls of different size may also be of the same type. A call
to allocate one of foo look quite different to malloc than a call to allocate
five of foo. But the types are the same with, I'd guess, similar death times.


|> Also, in my understanding, it's usually not the compiler that does the
|> rounding (although the compiler does padding)---doubleword alignment is
|> usually enforced by the allocator;  at least some of the information is
|> usually passed along through the allocator interface, although padding
|> may filter out some of it.

  Who requires the alignment varies. The allocator may enforce a larger alignment
policy inorder to make its job easier. The compiler may enforce an alignment
policy due to architectural reasons. Or it may be a combination of both. The
allocator had better enforce the compiler/libraries at a minimum.


|> I agree that this sounds like I'm talking out of both sides of my mouth,
|> which I am, to some degree.  But I think data structure designers ultimately
|> have to take the fall for some things.  There are some things that are just
|> easier to fix in the data structures than in the compiler and/or runtime
|> system.  Too much data structure work is still targeting early 1960's
|> architectures---no VM, no cache.
|> 
|> It's time data structure and algorithm designers moved into the real world.
|> Libraries should provide alternative implementations of basic collection
|> types with different *locality* characteristics for large collections. 
|> It's just way easier for a data structure designer to get this right in
|> the first place than for a compiler and/or runtime system and/or hardware
|> to recover information that was thrown away before a line of code got
|> written.  In a sense, a b-tree is just a more natural structure for a
|> large hierarchical memory than a binary tree is.  Of course, for mostly
|> in-memory use you need to pay more attention to optimizing searches
|> within a node than database people usually do.

|> I think that there will be allocator policies that are similarly robust,
|> and for some reasons that are similar at a very high level of absraction:
|> 
|>   1. Programs aren't out to get you;  they're artifacts designed to
|>      solve problems, not to foil a memory allocator.  As long as you
|>      don't do things that lose big, you'll usually do pretty well.
|>      The bad cases don't appear to occur very often, if you don't
|>      do something stupid and shoot yourself in the foot.

  As you rpaper points out, we really don't know what the "good" things are to
exploit nor do we know what the "bad" things are to avoid. We know a few bad
things and a few things that seem to work good but we really don't know
why. Basicly, we've been lucky.

|>   2. Programs exhibit locality in their allocation behavior, as well
|>      as in their memory-referencing behavior.  They are patterned in
|>      ways that should be exploitable---we've already got allocators
|>      that work pretty well for most programs, and there's no reason
|>      to think that they're the best we can do.  (The process that
|>      resulted in the allocators we've got was not very scientific or
|>      empirical.  I have to guess that when we figure out what's really
|>      going on, we can do at least a little bit better.)
|> 
|>   3. In terms of locality of reference, at least, it's more important
|>      not to lose big on any common patterns than it is to exploit
|>      every bit of information to the fullest.  This has to do with
|>      the granularity issues discussed in my last posting, and is why
|>      LRU almost always works better than LFU.  You don't have to make
|>      very precise predictions (and LRU doesn't) as long as you can
|>      make reasonably good discriminations.  (LRU does this by not
|>      screwing up small-scale stuff: it NEVER evicts something that's
|>      very recently touched;  its mistakes tend to be in the longer-term
|>      patterns, which occur much less often and therefore don't pose
|>      as much of a danger, on average.)

  Ah, the MinMax philosophy. Don't try to win, just try to keep from losing. (And
hope the other guy goofs up.) This philosophy never has appealed to me. I do
admit though that it usually does work.


|> >|> Clustering is where it's at, in my view---you want the right data
|> >|> grouped together so that a dumb fetch of a block or page will grab
|> >|> a lot of useful data.  It doesn't have to be data that you'll touch
|> >|> in the next 100 instruction cycles.  If you use it within 10,000
|> >|> instruction cycles, it will usually be good.  (For virtual memories,
|> >|> if you use it within the next *billion* instruction cycles it will 
|> >|> usually be good;  compilers are very bad at dealing with patterns on
|> >|> that scale, but runtime systems can be very good at it.)
|> >
|> >  Within the next 10,000 of your instruction cycles? Maybe. Within a
|> >billion? No way. Never.
|> 
|> I'm not sure you realize how much faster CPU's are than disks---most
|> people don't, and it's kind of amazing when you think about it.

  A LOT! But we solve most people's problem by making them buy enough RAM to hold
their dataset. To my mind, the more important ratios are secondary cache to
main memory, and primary chache to secondary cache. I think a good allocator
should worry about not about making sure that the working set fits in main memory
but in the caches. Related things should get allocated on the same 4Kb page. I
think you might need more info that you can currently get for the malloc
interface for this.

|> On the other hand, the gap between CPU and disk speeds is widening;

  The gap between the CPU and memory (whether cache or main memory) is also
widening at the same rate. So an on-chip cache miss is bad. Missing the secondary
cache is a disaster.

|> This implies that clustering is more attractive than true prefetching,
|> because if I can group together things that are touched at very roughly
|> the same time (e.g., within a few seconds of each other), I win.  I
|> don't have to make very short-term predictions, and I don't have to
|> predict very precisely.

  But not just clustering into main memory. You have to get the items to cluster
properly in the caches too. (I take it from your current interest in locality
that you're aware of this problem.)


|> >  True. But the problem with runtime info is that you've already
|> >committed yourself before the patterns are apparent. I think local, static
|> >analysis can help here. True, it doesn't help with global issues. But
|> >once again, some knowledge is often better than none. (It's also often worse.)
|> >
|> 
|> In principle, I think I agree.  It's just that in practice, we should be
|> able to do a good job with the information at hand, once we understand
|> its significance.  Later we can go looking for more information to get
|> further improvements.  You have to be careful that gathering the and
|> exploiting the information isn't more expensive than it's worth, which
|> is why I'm biased towards simple solutions--but based on sophisticated
|> knowledge about why they work.  That's where the hard work comes in...

  I totally agree.

  Mike McDonald
  ·······@engr.sgi.com
From: Henry Baker
Subject: Re: allocator and GC locality, and locality generally (was Re: cost of malloc)
Date: 
Message-ID: <hbaker-3107951036070001@192.0.2.1>
In article <··········@fido.asd.sgi.com>, ·······@engr.sgi.com (Mike
McDonald) wrote:

>   Yep. According to this professor, compiler construction is only about
> parsing
> with error recovery. Maybe a little pinhole optimization too.
                                      ^^^^^^^

Make that 'pinhead' optimization [something that Zippy would do, e.g.].

-- 
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html
From: Paul Johnson
Subject: Re: allocator and GC locality, and locality generally (was Re: cost of malloc)
Date: 
Message-ID: <40ceal$oh9@miranda.gmrc.gecm.com>
Paul Wilson (······@cs.utexas.edu) wrote:

> To be more concrete, I think that phase behavior is quite important to
> locality.  Objects created at *roughly* the same time are likely to be
> used for similar purposes, but maybe not if they are of different *types*.
> If they are of different types, they are likely to be used for different
> purposes---e.g., some may hold intermediate values of a computation, while
> others hold results that outlive the phase that creates them.

> Assuming we don't know the types of objects being allocated, as conventional
> allocators don't, it's reasonable to use size heuristically to discriminate
> between different types of objects created at about the same time.

How about looking at the call stack?  You can identify the allocate
call by looking at your return address.  Then localise objects
according to that.  You could extend this scheme by hashing the last
few return addresses together to get some idea of the context.

I have sometimes wondered about an instrumented GC which records
allocation and reference patterns in order to find out which
allocations are used to create objects which link to which other
objects, so that they can be grouped together.  This data could be
collected for a number of test runs and then used to generate
optimisation tables for the production version.  I don't think that
this could use the return addresses to identify things (since that
will change between runs), but the generated code could maintain
something similar.

Does this make sense?

> It's not as good as real type information, but it'll probably get most of
> the same benefit.  (In my personal opinion, that is.  Soon I may have data
> that say whether I'm right.)

Identifying the allocation context will probably give better value.
Consider a linked list class which uses chains of cons-cells.  It is
best if each discrete list is co-located, rather than putting all the
cons-cells into a single huge pool.  Context-based allocation would be
able to do this.


Paul.

--
Paul Johnson            | GEC-Marconi Ltd is not responsible for my opinions. |
+44 1245 473331 ext 2245+-----------+-----------------------------------------+
Work: <············@gmrc.gecm.com>  | You are lost in a twisty maze of little
Home: <····@treetop.demon.co.uk>    | standards, all different.
From: Stefan Monnier
Subject: Re: allocator and GC locality, and locality generally (was Re: cost of malloc)
Date: 
Message-ID: <40dbfo$fgp@info.epfl.ch>
In article <··········@miranda.gmrc.gecm.com>,
Paul Johnson <···@gmrc.gecm.com> wrote:
] How about looking at the call stack?  You can identify the allocate
] call by looking at your return address.  Then localise objects
] according to that.  You could extend this scheme by hashing the last
] few return addresses together to get some idea of the context.

ignoring the fact that looking at the stack is not portable, there is another
problem: how to you plan on making the thing fast enough ?
(improving locality is good, but you don't want your allocator to take ten times
as much time, do you?)


	Stefan
From: Erik Naggum
Subject: Re: allocator and GC locality, and locality generally (was Re: cost of malloc)
Date: 
Message-ID: <19950811T135001Z@naggum.no>
[Stefan Monnier]

|   ignoring the fact that looking at the stack is not portable,

memory allocation is never portable.  I just realized that this may be why
C++ is so bad in this area, and why C++ programmers write their own
allocators all the time.  non-portable things should be isolated and have
geniuses optimize the hell out of them.

|   there is another problem: how to you plan on making the thing fast
|   enough ?  (improving locality is good, but you don't want your
|   allocator to take ten times as much time, do you?)

someone said that one page fault costs on the scale of 10,000,000
instructions on today's CPU, memory, and disk speeds.  if you avoid a page
fault, you have time to spare.  later, you can optimize the savings...

here's one statistic: I have helped a math grad student write a C program
that does some horrible analysis on millions of points in fluid mechanics.
she insisted upon an access pattern that trashed the machine she ran it on,
so I wrote a custom matrix allocator that matched her access pattern, and
runtime fell with 70%.  bonus: this allocator is also *much* faster than
your regular 5000 malloc calls to set up the indirection vector into the
matrix, since it allocates all of the memory in one fell swoop.  it could
of course be greatly improved with a language that let me specify how to
access the memory, but worrying about the allocator does pay off at times.

#<Erik 3017137801>
-- 
#<Xyzzy 202B370B73>
From: Peter Holzer
Subject: Re: allocator and GC locality, and locality generally (was Re: cost of malloc)
Date: 
Message-ID: <40n8j5$68p@wsrcom.wsr.ac.at>
"Stefan Monnier" <··············@epfl.ch> writes:

>In article <··········@miranda.gmrc.gecm.com>,
>Paul Johnson <···@gmrc.gecm.com> wrote:
>] How about looking at the call stack?  You can identify the allocate
>] call by looking at your return address.  Then localise objects
>] according to that.  You could extend this scheme by hashing the last
>] few return addresses together to get some idea of the context.

>ignoring the fact that looking at the stack is not portable, there is another
>problem: how to you plan on making the thing fast enough ?
>(improving locality is good, but you don't want your allocator to take ten times
>as much time, do you?)

Doesn't have to take much time. It is probably sufficient to use the
top 3 or 4 return addresses (1 almost certainly isn't enough. Too many
programs use a wrapper around malloc), each of which means 2 memory
accesses (close together, probably in the same cache line) and an
addition. Plus the overhead of the hash function of course. If it saves
1 page fault per 10000 mallocs it has repaid its cost several times.

	hp

--
   _  | Peter Holzer | Systemadministrator WSR | ···@wsr.ac.at
|_|_) |-------------------------------------------------------
| |   |  ...and it's finished!  It only has to be written.
__/   |         -- Karl Lehenbauer
From: Stefan Monnier
Subject: Re: allocator and GC locality, and locality generally (was Re: cost of malloc)
Date: 
Message-ID: <40np1e$1fp@info.epfl.ch>
In article <··········@wsrcom.wsr.ac.at>,
Peter Holzer <···@wsrdb.wsr.ac.at> wrote:
] Doesn't have to take much time. It is probably sufficient to use the
] top 3 or 4 return addresses (1 almost certainly isn't enough. Too many
] programs use a wrapper around malloc), each of which means 2 memory
] accesses (close together, probably in the same cache line) and an
] addition. Plus the overhead of the hash function of course. If it saves
] 1 page fault per 10000 mallocs it has repaid its cost several times.

First thing, unless your program really eats up much memory, this is likely to
not save any page fault. (of course, you might disagree, based on what you
consider as a typical program, but the point still holds: several programs's
working sets fit in memory. Saving memory is never bad, but in the usual unix's
view of "one user, one workstation, one active program", the user won't notice
the improvement.

Anyway, the speed of malloc *is* important because of the impact it has on
programmers. If malloc is considered as slow (as is often the case), people will
continue to write custom routines and to avoid dynamic allocation (of course,
they also avoid it because of the burden implied by the lack of GC).
A fast malloc consists of very few cycles (basically: find the free list
corresponding to the required size and pop an element). Your 3 or 4 lookups (the
first will be trivial but the others might be more involved since they depend on
the frame size) plus hash-table access will probably mean that malloc will be at
least 3 times slower (and I consider 3 as very optimistic). It's definitely not
negligible and it definitely had better improve noticeably the locality to be
worth bothering !


	Stefan
From: Peter Holzer
Subject: Re: allocator and GC locality, and locality generally (was Re: cost of malloc)
Date: 
Message-ID: <40slnm$ngb@wsrcom.wsr.ac.at>
"Stefan Monnier" <··············@epfl.ch> writes:

>In article <··········@wsrcom.wsr.ac.at>,
>Peter Holzer <···@wsrdb.wsr.ac.at> wrote:
>] Doesn't have to take much time. It is probably sufficient to use the
>] top 3 or 4 return addresses
[...]

>First thing, unless your program really eats up much memory, this is
>likely to not save any page fault.

Two counter arguments:

1) There is no reason why there should be only one implementation of
malloc on a system. There could be several, tuned for different programs
behaviour. In this case, programs which don't use the stack-tracing
malloc aren't hit by it.

2) In my experience, workstations do page. Ok, once I forgot to
configure swapspace on my PC (32MB, linux) and only noticed it after
a week, but a 48MB Sun we had for testing swapped quite heavily, and
my PC at the office (also 32MB, linux) also pages now and then.
This is for everyday sysadmin's use (lots of xterms, mail, netscape,
some perl scripts, gcc every now and then). If you get into scientific
computing memory consumption grows fast (I should ask Franz to
instrument his economic models with malloc-tracing. I think that would
be interesting).

>Anyway, the speed of malloc *is* important because of the impact it
>has on programmers. 
[...]
>Your 3 or 4 lookups (the first will be trivial but the
>others might be more involved since they depend on the frame size) plus
>hash-table access will probably mean that malloc will be at least 3
>times slower (and I consider 3 as very optimistic). It's definitely not
>negligible and it definitely had better improve noticeably the locality
>to be worth bothering !

I would have thought a factor of two to be realistic. Apart from that,
I agree fully. Malloc must be fast and using the call information must
gain something (speed or space) to be worth the bother. I am not sure
how the call information can be used (haven't read that paper yet)
and I suspect it might need tweaking for each application. I do think
that there is some potential there and that it should be implemented 
and tried.

	hp
--
   _  | Peter Holzer | Systemadministrator WSR | ···@wsr.ac.at
|_|_) |-------------------------------------------------------
| |   |  ...and it's finished!  It only has to be written.
__/   |         -- Karl Lehenbauer
From: Hans Boehm
Subject: Re: allocator and GC locality, and locality generally (was Re: cost of malloc)
Date: 
Message-ID: <40nuc9$1qp@news.parc.xerox.com>
"Stefan Monnier" <··············@epfl.ch> writes:

>In article <··········@miranda.gmrc.gecm.com>,
>Paul Johnson <···@gmrc.gecm.com> wrote:
>] How about looking at the call stack?  ...

>... how to you plan on making the thing fast enough ? ...

This is all discussed in great detail in "Using Lifetime Predictors to Improve
Memory Allocation Performance", Barrett and Zorn, PLDI 93
(SIGPLAN Notices 28, 6, pp. 187-196).  There are even some measurements
there.

Hans-J. Boehm
(·····@parc.xerox.com)
Standard disclaimer ...
From: Henry Baker
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <hbaker-3107951026250001@192.0.2.1>
In article <··········@fido.asd.sgi.com>, ·······@engr.sgi.com (Mike
McDonald) wrote:

>   I should preface my comments with a statement stating I know next to
> nothing
> about modern allocators and GCs. The last time I studied any GC in
> detail was for
> a graduate compiler construction class at UIUC in 83/84. After I
> presented my
> report on Henry Baker's generational (I think) GC, the professor stated
> that GC
> had nothing to do with compilers. I wimped out and let it slide instead
> of
> challenging him.

If you are referring to my April 1978 CACM paper, it does _not_ describe
what people would now call a 'generational' GC, but a straight-forward copying
GC which is scheduled in peculiar way.

[There seems to be a lot of confusion about this.  Coplien's "Advanced
C++ Programming Styles and Idioms", Addison-Wesley, 1992, is also very
confused about this.  He describes my 1978 GC as a "mark-sweep" gc, which
characterization most GC people would disagree with.]

If anyone is interested, this paper is available online at

ftp://ftp.netcom.com/pub/hb/hbaker/RealTimeGC.html (also .ps.Z).

>   I too am in favor of GCs from a sofware engineering point of view.
> I've spent
> too much of my life already worrying about malloc/free in both my
> programs and in
> others.

What, you don't work for 'Purify'??  ;-)

According to Jon Bentley and Bjarne Stroustrup, rewriting malloc/free
is one of the most rewarding forms of entertainment for SW people.  ;-)

>   I don't think allocation order and size gets you vary far. First,
> allocation
> order tells you nothing about usage order. The order of allocation is
> very
> dependant on the algoritms used and the style of the writer. Changing
> the order
> of a couple of C lines can drasticly change the order of allocation.

Exactly!!  Which is why many 'studies' aren't worth the paper they are
printed on.

-----

My own pet peeve is _weak_ language type systems like Ada and C++ that
_force_ you to break apart a single object into a number of smaller subobjects
_that are separately mallocated_ even when you know how to manage the
object as one object in assembly language.  Stroustrup's C++ is especially
'pointer-happy', as a perusal of the following two books will show:

Stroustrup, Bjarne.  "The C++ Programming Language, Second Edition".  Addison
Wesley, 1991.

Ellis, M.A. and Stroustrup, B.  "The Annotated C++ Reference Manual".
Addison Wesley, 1990.

Whenever things get tough, C++ wimps out and forces another pointer indirection.

Of course, this 'solves' the problem for the C++ implementor, but throws
a much larger problem over the fence onto the poor application writer
and/or memory manager, who is now forced to deal with a much higher load
of smaller objects.

Whoever said that C++ doesn't force you to pay for features you don't use
hasn't ever programmed in C++ to any extent.

-- 
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html
From: Scott Wheeler
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <jyvlsr@bmtech.demon.co.uk>
In Article <··········@fido.asd.sgi.com> Mike McDonald writes:
>In article <······@bmtech.demon.co.uk>, Scott Wheeler 
<······@bmtech.demon.co.uk> writes:
>
>
>|> class str {
>|>     int iCount;
>|>     char achData[256];
>|> };
>|>
>|> with the obvious problems in terms of fixed maximum length of the
>|> string.

BTW, note that I wasn't recommending this design, just providing it for 
illustration.

>
>  Yuch! In C, (I don't know how to do it in C++) you'd do something 
like:
>
>typedef struct string {
>  int length;
>  char contents[0];
>  } *String;
>
>String make_string(int n)
>{
>  String s;
>
>  s = (String) calloc(1, sizeof(struct string) + n);
>  if (s == NULL)
>    you_lose();
>  s->length = n;
>  return (s);
>}
>
>
>(If your C compiler doesn't like arrays of length zero, declare 
>contents as length 1 and subtract 1 from the sizeof(struct string).)
>

Of course I've come across this sort of kluge before in maintaining old 
Unix code, but to me it epitomises all the vices of which non-C 
programmers accuse us. Thoroughly 'orrible. Incidentally, it's not 
legal ANSI C or C++, because you are only allowed to index one position 
after the end of the array. Granted, it will usually work, apart from 
the pointer size problem someone else pointed out. Oh, and you may get 
alignment problems. In an equivalent C++ class, you would *probably* 
avoid clobbering the pointer to the vtable, because it's usually in 
front of the first data member, but no guarantees; especially if you do 
multiple inheritance from this class. 

>  This way, both the length and then data are allocated together in 
the heap. And you haven't limited the max length artifically. (I 
>suspect most BASIC systems do this.)

Er, you *have* limited the length, but have deferred the decision to 
run-time. Not a big win. With the normal technique (pointer to separate 
character array), you can add to the string as much as you want, and 
rely on the class replacing the character array with a bigger one when 
necessary.
>
>|> I am afraid I am now unclear as to what you are saying can be more
>|> than 3x faster than the C++ means of handling strings in respect of
>|> allocate/deallocate. Can you sketch out your preferred solution, 
and
>|> perhaps we can work out some way of benchmarking it?
>|>
>|> Scott
>
>  A page fault or cache miss between accessing the length and the 
>contents cost one heck of a lot! I wouldn't be surprised that with a 
>lot of allocators, the miss happens often enough to cause the 3x 
>difference in speed overall.

Yes, but *is* there a 3x difference in practice? I really doubt it, 
because most C++ programs are reasonably memory efficient (compared to 
something like Lisp, which needs to carry a lot more run-time support) 
and are quite likely to sit mainly or entirely in core. Specify some 
realistic operating conditions and we can try it out.

Scott
From: Henry Baker
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <hbaker-0408950815320001@192.0.2.1>
In article <······@bmtech.demon.co.uk>, Scott Wheeler
<······@bmtech.demon.co.uk> wrote:

> >> In Article <·······················@192.0.2.1> Henry Baker writes:
> >> >All you've said, is that if you go along with Stroustrup/Coplien's
> >> >programming styles, you can get the job done, sort of.  However, I
> >> >would argue that following their advice can cost you 3X - 100X in
> >> >performance, because you drive the poor allocator/deallocator 
> crazy.
> >> >Forcing a program to go 2 indirections to access array/string 
> elements
> >> >because C++ has no way to store the length of the array/string 
> along
> >> >with the array/string is probably the most irritatingly obvious
> >> indication of the problem, but the problem is a good deal more
> >> widespread than this.
> >>
> >> Surely you are not claiming that a Pascal/BASIC-type string 
> arrangement
> >> in memory is going to run at least 3x faster than a typical C++ 
> string?
> >> This strains credibility.
> >
> >If you allocate/deallocate these things very often, 3x may be 
> >conservative. You pay some overhead simply accessing the string on a 
> >machine with a longer cache line, because the string header will cause 
> >a cache fault, and proceed to bring in a lot of stuff that won't be 
> >used, followed by an indirection which causes another cache fault when 
> >you start gaining access to the characters themselves.
> >
> >The C++ beginner is sucked in because 1) strings are 0x00 terminated,
> >and therefore don't usually require that the programmer keep separate
> >track of a length, and 2) the actual length of the storage is kept in
> >a 'hidden' location managed by the allocator itself.  Thus, you get
> >another example of hidden overhead that you can't even access yourself
> >-- e.g., to get a rough idea of how big the string is before searching
> >for the 0x00.
> 
> I had assumed from your "2 indirections" that you were talking about 
> C++ strings, not C strings - the former are usually implemented with 
> length stored explicitly, not relying on null termination. Data 
> representation is something like
> 
> class str {
>     int iCount;
>     char *pachData;
> };
> 
> Of course this can mean that the length count and character array 
> pointer are stored separately from the array itself, which was what I 
> thought your complaint was. My reference to Pascal and BASIC was 
> because they are usually implemented similarly but with the count and 
> the character array stored contiguously, avoiding one level of 
> indirection and only allocating one block of memory. Data 
> representation is often something like
> 
> class str {
>     int iCount;
>     char achData[256];
                   ^^^   This is silly.  The usual C solution is:
      char achData[];  /* Whatever length you need */
> };
> 
> with the obvious problems in terms of fixed maximum length of the 
> string.

If you want to be able to increase the size of the string, then you will have
to reallocate the string and store a new pointer into the header.  But
anyone using such 'growable' strings would presumably be ready to pay
this penalty.  

I would contend that the vast majority of strings are not only fixed
length, but _constant_, and since there is no need to grow the string,
there is also no need to force a second indirection.  The problem with
C++ is that there is no sanctioned way to store the length field contiguously
with the characters themselves without a lot of casting to get around
the type system.  You basically have to go back to C mode and perhaps
even call malloc directly.  You lose all of the 'advantages' of the C++
type system and ctors/dtors.

> I am afraid I am now unclear as to what you are saying can be more 
> than 3x faster than the C++ means of handling strings in respect of 
> allocate/deallocate. Can you sketch out your preferred solution, and 
> perhaps we can work out some way of benchmarking it?

-- 
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html
From: Cyber Surfer
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <807569893snz@wildcard.demon.co.uk>
In article <·······················@192.0.2.1>
           ······@netcom.com "Henry Baker" writes:

> I would contend that the vast majority of strings are not only fixed
> length, but _constant_, and since there is no need to grow the string,
> there is also no need to force a second indirection.  The problem with

Agreed. That's certainly the case in my C/C++ code. That's the nature
of the apps I write, rather than the way I code.

> C++ is that there is no sanctioned way to store the length field contiguously
> with the characters themselves without a lot of casting to get around
> the type system.  You basically have to go back to C mode and perhaps
> even call malloc directly.  You lose all of the 'advantages' of the C++
> type system and ctors/dtors.

Yep. I hate that, but I can live with it. I have to live without other
things, things that I'd _much_ rather have. So it goes. I'm not alone.
-- 
<URL:http://www.demon.co.uk/community/index.html>
<URL:http://www.enrapture.com/cybes/namaste.html>
"You can never browse enough."
From: Hans Boehm
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <405k8h$emi@news.parc.xerox.com>
······@netcom.com (Henry Baker) writes:

>I would contend that the vast majority of strings are not only fixed
>length, but _constant_, and since there is no need to grow the string,
>there is also no need to force a second indirection.  The problem with
>C++ is that there is no sanctioned way to store the length field contiguously
>with the characters themselves without a lot of casting to get around
>the type system.  You basically have to go back to C mode and perhaps
>even call malloc directly.  You lose all of the 'advantages' of the C++
>type system and ctors/dtors.

It still seems to me that we arguing essentially about the best way to
implement the wrong interface.

We seem to agree that most strings are (or should be) constant.  But
presumably they are not explicit constants in the program source.  So
they are computed somehow and then not modified.  Typically they are
computed as the concatenation of other strings, as the substring of
another string, or by some combination of these, starting from program
constants, integer-to-string conversions, etc.

This argues that we should be concerned with:

1. Immutable strings.
2. Concatenation time and space requirements.
3. Substring time and space requirements.
4. String access time requirements.

String access time is linear for all reasonable implementations.

Concatenation takes linear time for all the proposals discussed here.
That means it takes quadratic time to build up a string by repeated
concatenation.  It can take basically constant time per operation if we use
a more structured string representation in which a string is a tree of
concatenations.  In this case, the concatentation operation doesn't
have to copy its arguments; the result shares them.  There is a potentially
exponential difference between the space required by conventional
flat strings, and that required by trees.

Similarly, the substring operation can take either constant or logarithmic
time if allow it to share storage with the original string.  This requires
at least one level of indirection.  Using the standard representation it
requires either linear time or a special substring class with one level of
indirection. 

None of this is new.  Cedar has used such strings ("ropes") for 10 years.
A C based implementation ("cords") is included with our garbage collector.

So we're ignoring a potentially exponential complexity difference, and
we're concentrating on the factor of 2 in the asymptotically inferior
solution?  Why?

Hans-J. Boehm
(·····@parc.xerox.com)
Standard disclaimer ...
From: Henry Baker
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <hbaker-0708951241390001@192.0.2.1>
In article <··········@news.parc.xerox.com>, ·····@parc.xerox.com (Hans
Boehm) wrote:

> ······@netcom.com (Henry Baker) writes:
> So we're ignoring a potentially exponential complexity difference, and
> we're concentrating on the factor of 2 in the asymptotically inferior
> solution?  Why?

Because for most small strings, your O() arguments are completely washed
out by constant factors.  Most of the concatenation I have seen done
in C programs is done by either sprintf or strcat, and these strings are
typically quite small.  Therefore, the allocation/deallocation overhead
swamps the differences in complexity that you mention.

There are programs that waste a lot of time doing string manipulation.  I
would imagine that most of the Perl-like stuff is quite non-optimal.  If
people aren't using better representations for these kinds of strings,
it's probably because implementing the algorithm as a string-processing
algorithm is not such a hot idea in the first place.  In other words,
someone would be better off changing the algorithm completely, rather
than worrying about some non-optimal string representation.

A good example is symbolic algebra.  You can get along for a while
with a representation that does no sharing at all.  But the hideous
complexity of most symbolic algebra algorithms forces most systems to
utilize better representations.  I'm not sure that a 'string' representation
would ever be competitive with a more sophsticated representation for
symbolic algebra.

-- 
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html
From: Stefan Monnier
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <4072u2$lqa@info.epfl.ch>
In article <·······················@192.0.2.1>,
Henry Baker <······@netcom.com> wrote:
] Because for most small strings, your O() arguments are completely washed
] out by constant factors.  Most of the concatenation I have seen done
] in C programs is done by either sprintf or strcat, and these strings are
] typically quite small.  Therefore, the allocation/deallocation overhead
] swamps the differences in complexity that you mention.

By the way. Is there any paper available discussing the distribution of length
of strings in some programs ?


	Stefan

PS: I fell like there are really two (at least) different strings, depeding on
their use. I don't consider a "token" string from a lexer the same as the
text data of a text widget
From: Hans Boehm
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <40apft$3im@news.parc.xerox.com>
······@netcom.com (Henry Baker) writes:

>In article <··········@news.parc.xerox.com>, ·····@parc.xerox.com (Hans
>Boehm) wrote:

>> ······@netcom.com (Henry Baker) writes:
>> So we're ignoring a potentially exponential complexity difference, and
>> we're concentrating on the factor of 2 in the asymptotically inferior
>> solution?  Why?

>Because for most small strings, your O() arguments are completely washed
>out by constant factors.  Most of the concatenation I have seen done
>in C programs is done by either sprintf or strcat, and these strings are
>typically quite small.  Therefore, the allocation/deallocation overhead
>swamps the differences in complexity that you mention.

The breakeven points for string concatenation are actually very small.
My measurements suggest it's at about 10 characters.  You do lose a
small constant factor (on the order of 2 with a fast implementation,
a bit more if you use the simplest C interface) in traversal time.
(See Boehm, Atkinson, Plass, "Ropes Are Better Than Strings",
PARC CSL-94-10 for details.  A revised version should appear in
SP&E shortly.)

For most programs I would greatly prefer a version that's a little
slower on typical input, but uses asymptotically good algorithms.
Such a program is much more robust.  I can expect it to keep working
reasonably if I feed it input that wasn't used for testing by the
program designer.

The program that's tuned for short strings with asymptotically
suboptimal algorithms will usually work fine.  But then it may start
taking hours instead of seconds when I increase the size of some input
that should have minimal effect on its running time.  For example,
I recently spent several hours tracing dismal mouse selection
response under rxvt to the size of the history buffer.  Apparently,
it was meant to be "small", and I set it at 5000 lines, like I always
do.  I haven't looked at the code, so I still don't understand why
that should matter.

>There are programs that waste a lot of time doing string manipulation.  I
>would imagine that most of the Perl-like stuff is quite non-optimal.  If
>people aren't using better representations for these kinds of strings,
>it's probably because implementing the algorithm as a string-processing
>algorithm is not such a hot idea in the first place.  In other words,
>someone would be better off changing the algorithm completely, rather
>than worrying about some non-optimal string representation.

I don't think you should use string manipulation where it's inappropriate.
But many programs (editors, mailers, news readers, etc.) naturally
manipulate long strings.  Currently they mostly invent their own custom
representations, except where the authors decided not to bother.  In those
places they effectively break on unexpectedly long input.  Why not use
a robust representation as the standard, and then optimize in those
few places where it doesn't perform adequately, and it's safe to
optimize?

Hans-J. Boehm
(·····@parc.xerox.com)
Standard disclaimer ...
From: Robert P. Krajewski
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <rpk-1108950023020001@192.0.2.1>
In article <···················@slsvhdt.lts.sel.alcatel.de>,
·····@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) wrote:

>The problem is, that the operations which need to be optimized are not
>the same for different applications.  A compiler will need different
>string handling than an editor, for example.
>
>With this in mind, I'm not really convinced that the solution is to
>try and create an optimal string class as a standard.  I rather think
>of the standard string class as a facility for people like myself,
>whose programs only do string handling secondarily (formatting error
>messages, and the like).  If I were writing an editor, for example, I
>would not expect the standard string class to be acceptable for my
>text buffers.  In this regard, just about any string class with the
>required functionality will do the trick.  (And it is more important
>that the string class be easier to use than that it be fast.)
>
>This does mean that most text oriented applications will have to
>`re-invent the wheel', in that they will have to write their own
>string class.  But I'm not convinced that there is a string class
>which would be appropriate for all applications, anyway.

OK, but what might be appropriate is that there's a standard string class
*interface*, so that even implementors of specialized string classes can
use standardized routines for the things they need, but don't need to be
optimized. Such a standard also allows them to pass such objects to
external libraries that expect a standard interface for strings.
Otherwise, there's a going to be lot of reimplementation going on. (As if
there isn't already.)

I designed a string class that was actually two string classes. The first
class was SimpleString. All it assumed was the existence of a pointer to
characters, C style (this is C++), and a length count. Even with this
constraint, which rules out the possibility of growing a string or
allocating storage later in its lifetime, the SimpleString class could do
a lot of useful work -- counting spacing characters, searching, copying,
conversion to various character sets, and even in-place modifications that
don't require reallocation in any case, such as case (lower/upper)
conversion.

By breaking the assumption of a certain kind of allocation, it was
possible to derive from this class to take care of differing situations.
First, there was the general string class that allocated character storage
from the heap and pretty much rounded out the set of the operations that
were destructive or involved side-effects. But you could also derive
classes that always expected that the storage would be managed for them,
especially in the very common case where a string's length would never
increase once it was created -- the character storage might come the
stack, or might have been cleverly arranged to come from the same heap
block as the heap allocation of the string object itself (i.e., allocate a
block the size of the string descriptor, plus the storage needed for
characters). So even "clever" strings could be passed to the simpler, more
general operators that didn't care about cleverness.
From: Hans Boehm
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <40o1dr$3ff@news.parc.xerox.com>
·····@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) writes:

...

>With this in mind, I'm not really convinced that the solution is to
>try and create an optimal string class as a standard.  I rather think
>of the standard string class as a facility for people like myself,
>whose programs only do string handling secondarily (formatting error
>messages, and the like).  If I were writing an editor, for example, I
>would not expect the standard string class to be acceptable for my
>text buffers.  In this regard, just about any string class with the
>required functionality will do the trick.  (And it is more important
>that the string class be easier to use than that it be fast.)

>This does mean that most text oriented applications will have to
>`re-invent the wheel', in that they will have to write their own
>string class.  But I'm not convinced that there is a string class
>which would be appropriate for all applications, anyway.

I agree that some applications will need their own string class.
Nontheless, I think a standard string class is important.  Many libraries
will require strings as input parameters, or generate strings.  If you
want any hope of having such libraries play together, they will need
a standard string representation.

I claim the characteristics you want from a standard string class are:

1) It should be easy to use, with a clean interface.  It should be
general enough to allow reusability of libraries that rely on it.

2) It should be robust.  The implementation should scale reasonably
to large inputs, both in that it remains correct, and that its performance
remains reasonable.

I would argue that a string class based on a conventional flat representation,
and whose interface exposes that, doesn't do very well on either criterion.
It's often difficult or impossible to pass some other string representation
(e.g. a file) to a string function without actually changing representation
(e.g. reading the whole file).  A tree-based representation can, at minimal
cost, accomodate strings with user specified character access functions.

A flat representation does allow in-place string updates,
a facility I would personally rather leave out of a "standard" string class.
You still have arrays of characters when you need them.

The standard string representations don't scale to long
strings.  Code that took 10 milliseconds on 100 character strings, may
take more than a minute on 10000 character strings.  Programs are unlikely
to be usable for string inputs much longer than what the author
directly anticipated with a custom string class, even if machines
have gotten appreciable faster in the interim.  This sort of behaviour
is exhibited by any program that builds a long string by repeated
concatenation, or repeatedly inserts characters in the middle of a long
string.

(Another, more subtle, problem with flat string representations is
that they don't interact very well with storage allocators,
especially again as strings get long.  Fragmentation is much less likely
with many short string chunks than with long contiguous strings.
Even copying garbage collectors may suffer, since they will (hopefully)
avoid moving large pointerfree objects.)

Hans-J. Boehm
(·····@parc.xerox.com)
Standard disclaimer ...
From: Ed Osinski
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <40o99k$r5@cmcl2.NYU.EDU>
In article <··········@news.parc.xerox.com>, ·····@parc.xerox.com (Hans Boehm) writes:
|> (Another, more subtle, problem with flat string representations is
|> that they don't interact very well with storage allocators,
|> especially again as strings get long.  Fragmentation is much less likely
|> with many short string chunks than with long contiguous strings.
|> Even copying garbage collectors may suffer, since they will (hopefully)
|> avoid moving large pointerfree objects.)

Aha!  So this can then be used as a 'reason' why garbage collection is
not a good idea.  :-)

---------------------------------------------------------------------
 Ed Osinski                  
 Computer Science Department, New York University         
 E-mail:  ·······@cs.nyu.edu 
---------------------------------------------------------------------
"No, no, no, don't tug on that. You never know what it might be attached to."
	-- Buckaroo Bonzai to assistant during brain surgery
From: Stefan Monnier
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <40ph96$gnt@info.epfl.ch>
In article <·········@cmcl2.NYU.EDU>, Ed Osinski <·······@cs.nyu.edu> wrote:
] In article <··········@news.parc.xerox.com>, ·····@parc.xerox.com (Hans Boehm) writes:
] |> (Another, more subtle, problem with flat string representations is
] |> that they don't interact very well with storage allocators,
] |> especially again as strings get long.  Fragmentation is much less likely
] |> with many short string chunks than with long contiguous strings.
] |> Even copying garbage collectors may suffer, since they will (hopefully)
] |> avoid moving large pointerfree objects.)
] Aha!  So this can then be used as a 'reason' why garbage collection is
] not a good idea.  :-)

I see the smiley, but still: he didn't say that flat strings interact poorly
with GC, but that they interact poorly with storage allocators and that even
copying GCs (which could avoid that bad interaction thanks to their compacting
behavior) don't help since they would often treat such big pointer-free objects
in a special way (they wouldn't copy them).

So what he says is that dynamically allocated flat strings are bad because of
their potential for generating fragmentation, even in the most favorable case
(a copying GC).

This doesn't look like a reason why GC is not a good idea !


	Stefan
From: Erik Corry
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <DDBsKu.H5B@kroete2.freinet.de>
Hans Boehm (·····@parc.xerox.com) wrote:

: I claim the characteristics you want from a standard string class are:

: 1) It should be easy to use, with a clean interface.  It should be
: general enough to allow reusability of libraries that rely on it.

: 2) It should be robust.  The implementation should scale reasonably
: to large inputs, both in that it remains correct, and that its performance
: remains reasonable.

3) You should be able to get a C-style null-terminated char array
out of it at minimal cost, because you are going to need this to deal
with just about every library written in C up until your great new
implementation came along.

This seems to me to be a major argument for having a flat representation
internally, too. If you don't have simple extraction of/conversion to
a C-style null-terminated character array, then your new string library
is not going to be used.

--
New Age: Science meets a modern Hydra
--
Erik Corry ·······@inet.uni-c.dk
From: Joseph H Allen
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <DDE1Gx.IxD@world.std.com>
In article <··········@kroete2.freinet.de>,
Erik Corry <····@kroete2.freinet.de> wrote:
>Hans Boehm (·····@parc.xerox.com) wrote:
>
>: I claim the characteristics you want from a standard string class are:
>
>: 1) It should be easy to use, with a clean interface.  It should be
>: general enough to allow reusability of libraries that rely on it.

>: 2) It should be robust.  The implementation should scale reasonably
>: to large inputs, both in that it remains correct, and that its performance
>: remains reasonable.

>3) You should be able to get a C-style null-terminated char array
>out of it at minimal cost, because you are going to need this to deal
>with just about every library written in C up until your great new
>implementation came along.

>This seems to me to be a major argument for having a flat representation
>internally, too. If you don't have simple extraction of/conversion to
>a C-style null-terminated character array, then your new string library
>is not going to be used.

You may want to look at the automatic string library in my editor joe (get
by anonymous ftp from ftp.worcester.com).  This library does nothing
interesting with memory allocation, but it satisfies '3' above nicely. 
Unfortunately you probably need indirection for any interesting GC.  Anyway
here's the man page for it.  There may be some ideas you can get for making
a more compatible string class.  There's also an automatic array of strings
library for things like environment variables and file name lists.

Name
	vsncpy, vsrm, sLEN, vstrunc, vsensure, vsins, vsdel, vsfill, vsset,
vsadd - Automatic string management functions

Syntax
	#include <vs.h>

	char *vsncpy(char *d,int off,char *s,int len);
	void vsrm(char *d);
	int sLEN(char *d);
	char *vstrunc(char *d,int len);
	char *vsensure(char *d,int len);
	char *vsins(char *d,int off,int len);
	char *vsdel(char *d,int off,int len);
	char *vsfill(char *d,int off,int c,int len);
	char *vsadd(char *d,int c);
	cgar *vsset(char *d,int off,int c);

Description
	This is a string library which supports strings which automatically
resize themselves when needed.  The strings know their own size, so getting
the length of a string is always a fast operation and storing NULs in the
string is permissable.  The strings are backward compatible with C's regular
zero terminated strings.

	Each automatic string is stored in its own malloc block and has the
following format:

	<bksize><length><string><zero>

	'bksize' and 'length' are integers which give the size of the malloc
block and the length of the string.  A zero character always follows the
string for compatibility with normal C zero-terminated strings.  The zero is
not counted as part of the string length.

	The strings are not addressed with 'bksize' (the beginning of the
malloc block).  Instead, they are addressed at the first actual character of
the string itself.  This means that an automatic string looks like a normal
C string and can be addressed with type 'char *'.  Also the array access
operator '[]' works for reading and overwriting automatic strings and
automatic strings can be passed directly to UNIX operating system functions. 
However, free() can not be used to dispose of automatic strings.  Instead,
vsrm() must be used.  Also an automatic string plus an offset is not an
automatic string, but is still a legal C language string.

Primary function
	_vsncpy_ - Copy a block of characters at address 's' of length 'len'
onto the automatic string 'd' at offset 'off'.  The automatic string is
expanded to handle any values of 'len' and 'off' which might be given.  If
'off' is greater than the length of the string, SPACEs are placed in the
gap.  If 'd' is NULL, a new string is created.  If 'len' is 0, no copying or
string expansion occurs.  _vsncpy_ returns the automatic string, which may
have been realloced or newly created in its operation.

	_vsncpy_ is the most important automatic string function.  It is
both the primary constructor of automatic strings and is also a useful
operator.  It works in close conjunction with the following macros:

	sc("Hello")	Gives --> "Hello",sizeof("Hello")-1
	sz(s)		Gives --> s,zlen(s)
	sv(d)		Gives --> d,sLEN(d)

	These macros are used to build arguments for _vsncpy_.  Many
functions can be created with combinations of sc/sz/sv and vsncpy:

	s=vsncpy(NULL,0,NULL,0);	Create an empty automatic string

	s=vsncpy(NULL,0,sc("Hello"));	Create an automatic string
					initialized with the string "Hello"

	d=vsncpy(NULL,0,sv(s));		Duplicate an automatic string

	d=vsncpy(NULL,0,sz(s));		Convert a C string into an automatic
					string

	d=vsncpy(sv(d),sv(s));		Append automatic string s onto d

	d=vsncpy(sv(d),sc(".c"));	Append a ".c" extension to d.

	d=vsncpy(d,0,sc("Hello"));	Copy "Hello" to the beginning of d. 
					The original length of d is
					unchanged, unless it had to be
					expanded to fit "Hello".

Other functions

	_vsrm_ is used to free an automatic string.  If NULL is passed to
it, nothing happens.

	_sLEN_ returns the length of an automatic string.  If the string is
NULL, sLEN returns 0.

	_vstrunc_ sets the length of an automatic string.  The string is
created if NULL is passed to it.  The string will be padded with spaces if
its length is increased.  Vstrunc may reallocate the string if (and only if)
it is expanded, so the return value must not be ignored.

	_vsensure_ reallocs the malloc block of the given string so that the
string can be later expanded to the specified length without any calls to
realloc.

	_vsins_ inserts a gap into a string.  If the string is NULL it is
created.  If the specified offset is past the end of the string, the string
is extended.

	_vsdel_ deletes a section of the string.  It does nothing if the
specified offset is past the end of the string.

	_vsfill_ fills a portion of a string to the specified character.

	_vsadd_ appends a single character to the end of the string.  A new
string is created if the specified string was NULL.  This function is very
useful for loops which create strings by appending some source of
characters.

	_vsset_ sets a character at a specified offset.  A new string is
created if the specified string was NULL.  The string is filled with SPACEs
if the specified offset is past the end of the string.
-- 
/*  ·······@world.std.com (192.74.137.5) */               /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}
From: Hans Boehm
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <40td0e$c0o@news.parc.xerox.com>
····@kroete2.freinet.de (Erik Corry) writes:

>Hans Boehm (·····@parc.xerox.com) wrote:

>: I claim the characteristics you want from a standard string class are:

>: 1) It should be easy to use, with a clean interface.  It should be
>: general enough to allow reusability of libraries that rely on it.

>: 2) It should be robust.  The implementation should scale reasonably
>: to large inputs, both in that it remains correct, and that its performance
>: remains reasonable.

>3) You should be able to get a C-style null-terminated char array
>out of it at minimal cost, because you are going to need this to deal
>with just about every library written in C up until your great new
>implementation came along.

>This seems to me to be a major argument for having a flat representation
>internally, too. If you don't have simple extraction of/conversion to
>a C-style null-terminated character array, then your new string library
>is not going to be used.

Granted, there's a minor loss here with a more structured
representation.  We can arrange (and have arranged) for
the conversion to "const char *" to be the identity function (and hence fast)
on most short strings.  But, in the general case, you have to copy the string.
(If the C routine may modify the string, then many of the proposals under
discussion here would have to copy the string.)

But if we assume that the C library function looks at each character at least
once (the usual case, I believe), then this still slows down the program by
at most a small constant factor.  This still seems like a win over the
frequently quadratic and potentially exponential slowdown and space
increase you get from the flat representation.


Hans-J. Boehm
(·····@parc.xerox.com)
Standard disclaimer ...
From: Hans Boehm
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <413cvb$h85@news.parc.xerox.com>
·····@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) writes:

>I don't see any possibility of passing a file to a ``standard'' string
>class without reading it anyway.  Remember, you do not have access to
>the internals of a standard string class to let it know about
>alternative representations, like a file.  So if it isn't provided for
>in the standard, you don't have it.

Structured strings will have to be pointers to one of at least two
different objects anyway (flat leaves and internal concatenation nodes).
That makes it inexpensive to add a third variant consisting of a length
and a user defined function (or object) for producing the ith character
of a string.  Both Cedar ropes and the cord package contain a way to
convert a file to a string such that it is read lazily on accesses, based
on such a facility.  In the case of cords, the conversion functions are
provided by the package itself, but they rely on none of the internals.
They could equally well have been written by a user who didn't have access
to the source.

Of course, random access to the resulting string can be expensive.  But the
function providing file access can and does cache, so it's reasonable for many
uses.  The toy editor supplied with the package has been used to edit files
tens of megabytes long in a few hundred KB heap using this facility.  The
editor still manipulates the 30MB file using string concatenation
and substring operations.  And it's not expensive to insert a character in
the middle.  If you know the insertion point, no reading of the file
is required, except to display the result (and that section is likely to be
cached). Searches that fail take a while, because they do require
reading the whole file.

Hans-J. Boehm
(·····@parc.xerox.com)
Standard disclaimer ...
From: Scott Wheeler
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <jywdfb@bmtech.demon.co.uk>
In Article <··········@news.parc.xerox.com> Hans Boehm writes:
>It still seems to me that we arguing essentially about the best way to
>implement the wrong interface.
>
>We seem to agree that most strings are (or should be) constant.  But
>presumably they are not explicit constants in the program source.  So
>they are computed somehow and then not modified.  Typically they are
>computed as the concatenation of other strings, as the substring of
>another string, or by some combination of these, starting from program
>constants, integer-to-string conversions, etc.
>
>This argues that we should be concerned with:
>
>1. Immutable strings.
>2. Concatenation time and space requirements.
>3. Substring time and space requirements.
>4. String access time requirements.
>...
>So we're ignoring a potentially exponential complexity difference, and
>we're concentrating on the factor of 2 in the asymptotically inferior
>solution?  Why?

A good point. Speaking only for myself, I usually work with programs 
where the number of live strings at any given time is low (hence the 
allocator locality problem is unlikely to cause thrashing). When 
thinking of an example where a large number of strings would be live at 
a given time, the symbol table of a compiler was the most obvious 
example, and the compilers I seen have done little or no alteration to 
symbols after lexing and parsing. I'm not sure what string operations 
would dominate in lexing - possibly search for substrings?

I'm afraid the tree implementation is new to me, and I'll try to read 
up on it. Presumably if the tree representation has an O(n^2) or O(ln 
n) advantage over the doubly indirect method under discussion, there is 
also a constant factor to consider, as I assume the logic to implement 
the tree is more complex. At roughly how many string operations would 
the tree method be expected to be superior?

Scott
From: Hans Boehm
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <3vtrro$2is@news.parc.xerox.com>
······@netcom.com (Henry Baker) writes:

[previous discussion about C++ requiring string headers and indirection elided]

>All you've said, is that if you go along with Stroustrup/Coplien's
>programming styles, you can get the job done, sort of.  However, I would
>argue that following their advice can cost you 3X - 100X in performance,
>because you drive the poor allocator/deallocator crazy.

>Forcing a program to go 2 indirections to access array/string elements because
>C++ has no way to store the length of the array/string along with the
>array/string is probably the most irritatingly obvious indication of the
>problem, but the problem is a good deal more widespread than this.

This discussion isn't very clear to me.  I doubt that the extra level of
indirection for strings normally costs you much on accesses.  Hopefully
most of the character accesses occur within library functions which
perform the indirection once and then look at many characters.  Even
if this is not the case, a clever compiler could usually amortize the
dereference over many character accesses.

Granted, there is some additional allocation cost involved.  But this at
most doubles the cost involved relative to keeping a length and the characters
in a single memory object.  Aside from cases in which this just pushes you
over the paging threshold, I don't see where the 3X to 100X slowdown
should come from.

There are other places in which commonly accepted C++ methodology can easily
result in such slowdowns.  Performing deep copies in order to maintain
unique references to heap objects can have horrible consequences.  You can
easily change a program with linear running time into one with exponential
time and space requirements.  The use of flat string representations,
as opposed to tree-based representations, can turn a program with linear
running time into one with quadratic running time.  But all of the
string representations discussed here are flat (and in my opinion poor
choices for a general-purpose string package).

Are we really talking about these other costs?  They are much more
related to the absence of garbage collection than they are to the type
system.

Note that there are other cases in which introducing additional objects
and indirection levels can save space, especially with garbage collection.
See ftp://parcftp.xerox.com/pub/gc/papers/pldi93.ps.Z for details.

Should this discussion be moved to comp.lang.c++?

Hans-J. Boehm
(·····@parc.xerox.com)
Standard disclaimer ...
From: Henry Baker
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <hbaker-3107951007190001@192.0.2.1>
In article <··········@jive.cs.utexas.edu>, ······@cs.utexas.edu (Paul
Wilson) wrote:

> Then I should restate a couple of things:  I'm very much pro-GC, for software
> engineering reasons, but I think that it ultimately costs you a little
> bit in locality, relative to a similarly good explicit deallocator
> (malloc/free kind of thing).

I think that you'd have to be very specific about what your assumptions
were in order to back this conclusion up.  You'd have to include things
like how smart the compiler was about killing dead things promptly,
and what language you are writing in, and what help the language/type
system gives to the programmer to do 'explicit' deallocation.

For a large fraction of 'vanilla' C/C++ code, however, I would strongly
disagree.  Most people's attempts at local 'optimizations' by doing
explicit deallocation will backfire, and lead to _both_ insecure code
_and_ lower performance.  In most professionally constructed GC systems,
the GC architect has thought a _lot more_ about locality and performance
issues than most application writers, and is also in a much better position
to do something positive about locality than most application writers.

> Unfortunately, comparisons are quite difficult because the state of
> allocator research and development is a bit of a mess.  We don't know
> much more than we did in 1971, which wasn't a whole lot.

Agreed.

> >Paul Wilson <······@cs.utexas.edu> wrote:
> >] Ken Dickey <····@apple.com> wrote:
> >] >At 11:38 AM 95/07/26 -0400, Scott McKay wrote:
> >] >>Another thing to note is that compacting GC's usually dramatically
> >] >>reduce ther overall working set of programs, which in turn reduces the
> >] >>number of page faults a program might take.  ...
> >
> >Is there hard data on this one ?
> 
> Essentially none.  Compaction is overrated, by the way.  It can do as much
> damage as good, if it's not done *right*.  As far as comparisons between
> conventional and GC'd systems go, there is essentially zero data on locality
> effects, and there's every reason to think that locality effects are 
> important.

With a high quality VM system and log-structured file systems, I'm afraid
you are right.  In these cases, the underlying VM has already gained most
of the advantage that compaction would offer.  One could gain a bit more
memory utilization with compaction, but the cost of actually doing the
compaction may not pay for itself.  With modern caches, I'm not sure if
there is any gain at all on that front.

IMHO the major gain from compaction is to defragment the address space,
but as Paul's recent work has shown, one may not have to do this very
often.

> For more info on this, see the paper allocsrv.ps in our repository of papers
> on memory management.  (ftp.cs.utexas.edu, in the directory pub/garbage.)

Excellent paper, and highly recommended.

> GC's (except for reference
> counting GC's) tend to do something that's intrinsically nasty for locality;
> they pollute the cache with short-lived objects and can't *reuse* that memory
> until a garbage collection detects that the memory is reusable.  You can limit
> the damage this causes, but you can't eliminate it entirely.

This is correct as far as the _address space_ is concerned, but completely
wrong as far as cache behavior is concerned.  Reinhold's wonderful thesis

Reinhold, Mark B.  "Cache Performance of Garbage-Collected Programming
Languages".  MIT/LCS/TR-581, Sept. 1993.  (Also PLDI'94??)

shows how a gc makes _excellent_ use out of a write-allocate cache, and
that short-lived objects cause no problems at all, because _they live
and die in the cache and need never cause any memory traffic at all_.

So the problem is bad HW architectures, not GC, per se.  (When, oh when,
will VM architects give us 'write-allocate' VM?  CS people have known
that it is important for 30 years, but somehow the OS people have never
gotten the word.)

> Yes.  On the other hand, I think one of the things that has limited progress
> is a failure to focus on the fundamentals---namely program behavior and
> what an allocator (or GC) can do to map that behavior nicely onto memory.
> Architects tend to have a very naive view of locality, which is that it's
> like a natural phenomenon that either is or isn't there.  In fact, it's a
> product of patterned program behavior interacting with patterned allocator
> (or GC) strategies, and these must be studied in relative isolation to
> figure out what's really going on.  I believe that there are common and
> simple patterns in program behavior that have not been noticed or exploited,
> and that this is why there's been very little progress in the study of
> allocators and the study of locality.

Excellent point.  Unfortunately, HW architects don't have a lot of
control over what SW people run on their machines.  It would be a
shame if HW architects go to a lot of trouble, however, to give people
something that can be much more effectively accomplished with better
SW.

> >This is the "understanding" problem: if the cace can notice that reading the
> >line is not necessary when allocating a new object, it means the cache
> >understands better the way the GC-allocator works. Your point about the
> >memory bandwidth of the write-backs could be argued against in saying that
> >there should be a way for the cache to notice that most of the write-backs
> >aren't necessary (I'm assuming here that the generations' sizes have been
> >carefuly chosen so that they "fit the memory hierarchy" (whatever this
> >really means in terms of relative sizes): in such a case most write-backs
> >will be for "old-space" objects that have been recognized as dead already).
> >Since there will still be write-backs of objects that *are* dead but haven't
> >been recognized as such yet, I guess manual allocators still have an edge.

Manual allocators have exactly the same trouble that GC's do if they
have no way to tell the HW that a particular cache line is now dead.
It's way past time for HW to have some ability for the SW to inform it
that things are dead.  I have argued elsewhere for things like a
"last load" instruction, which informs that HW that after supplying
the contents of this memory location, it is now dead, and need not
be written back.  In conjunction with a write-allocate cache, much
memory traffic can be eliminated by the compiler.

> I don't think so.  For small caches, by the time you've compacted data it's
> too late---you touched too much memory in between GC's, because you couldn't
> reuse memory immediately.

Wrong intuition.  Worry more about cache behavior.  See Reinhold's thesis again.

> >For lack of hard data, this is probably blowing hot air, tho !
> >But the point still holds: it'd probably be a good idea to provide ways to
> >tell the memory hierarchy what's going on (like a "trash" instruction to
> >indicate that some memory region can be trashed (to free the old-space) or a
> >destructive read (can be used for linear-programming, or for stack-pops)).
                                     ^^^^^^^^^^^^^^^^^^

I think you mean 'linear logic style of programming' here!

-- 
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html
From:  Robin Popplestone
Subject: Re:allocator and GC locality.
Date: 
Message-ID: <3vtvi5$g8a@kernighan.cs.umass.edu>
> In article <··········@info.epfl.ch>,
> Stefan Monnier <··············@epfl.ch> wrote:
>In article <·······················@192.0.2.1>,
>Henry Baker <······@netcom.com> wrote:
>] IMHO the major gain from compaction is to defragment the address space,
>] but as Paul's recent work has shown, one may not have to do this very
>] often.

And Paul Wilson writes in <··········@jive.cs.utexas.edu>

> Right. For "most" programs (that have been studied in enough detail, that
> is), it looks like you never have to do it at all.  Nobody's studied
> really long-running programs, though, and it's not clear whether
> fragmentation will accumulate over time, or stabilize at an acceptable
> level.

The longest running,  biggest mix, garbage  collected systems are  probably
those in which the operating system  itself is subject to the same  garbage
collector as  user-programs.  Experience  with  the  Multipop  time-sharing
system over  2 decades  ago showed  that compaction  was necessary  in  the
circumstances pertaining then, at  least in a  small machine (384KB).  Both
data and code-blocks were garbage collected, and the job-mix included
robotic programs that had largish digitised images as well as classic
list processing such as the (original) Boyer-Moore theorem prover.
The garbage collector operated in-situ (no virtual memory!), but had a
compaction pass which was conditionally invoked. Without compaction, the
system would rapidly become unusable and have to be rebooted. With
compaction it would run for a (small) number of days.

How this experience would scale to a modern environment is not obvious.

Robin Popplestone.
From: Paul Wilson
Subject: Re: allocator and GC locality.
Date: 
Message-ID: <400i6p$865@jive.cs.utexas.edu>
In article <··········@kernighan.cs.umass.edu>,
Robin Popplestone  <···@cs.umass.edu> wrote:
>Paul Wilson writes in <··········@jive.cs.utexas.edu>
>
>> Right. For "most" programs (that have been studied in enough detail, that
>> is), it looks like you never have to do it at all.  Nobody's studied
>> really long-running programs, though, and it's not clear whether
>> fragmentation will accumulate over time, or stabilize at an acceptable
>> level.
>
>The longest running,  biggest mix, garbage  collected systems are  probably
>those in which the operating system  itself is subject to the same  garbage
>collector as  user-programs.  Experience  with  the  Multipop  time-sharing
>system over  2 decades  ago showed  that compaction  was necessary  in  the
>circumstances pertaining then, at  least in a  small machine (384KB).  Both
>data and code-blocks were garbage collected, and the job-mix included
>robotic programs that had largish digitised images as well as classic
>list processing such as the (original) Boyer-Moore theorem prover.
>The garbage collector operated in-situ (no virtual memory!), but had a
>compaction pass which was conditionally invoked. Without compaction, the
>system would rapidly become unusable and have to be rebooted. With
>compaction it would run for a (small) number of days.

I find this pretty interesting.  I take it this was a unified heap,
like in a Lisp machine system, where you don't have the usual process
memory boundaries, but unlike a Lisp machine in that you seldom do
compaction very often.

Unified heaps for multiple processes are interesting, because nobody
knows whether the intermixing of objects created and used for different
concurrent processes tends to increase fragmentation, or reduce it,
or what.  (And I think that's likely to be very allocator dependent.)

The fact that this system could run for a day or more without accumulating
pathological fragmentation is a good sign, up to a point.

Is there a more detailed description of the allocator strategy?  (E.g.,
did the allocator attempt to separte objects created by different
processes, or just mix them together in allocation order, or what?)

Anecdotes about what was tried and *didn't* work well would be as
interesting as descriptions of what was eventually settled on.

>How this experience would scale to a modern environment is not obvious.

Too true, but it's food for thought.

>Robin Popplestone.

-- 
| Paul R. Wilson, Comp. Sci. Dept., U of Texas @ Austin (······@cs.utexas.edu)
| Papers on memory allocators, garbage collection, memory hierarchies,
| persistence and  Scheme interpreters and compilers available via ftp from 
| ftp.cs.utexas.edu, in pub/garbage or http://www.cs.utexas.edu/users/oops/)      
From: Kelvin Nilsen
Subject: Re: allocator and GC locality.
Date: 
Message-ID: <kelvin.807660304@kickapoo>
···@cs.umass.edu ( Robin Popplestone ) writes:

>> In article <··········@info.epfl.ch>,
>> Stefan Monnier <··············@epfl.ch> wrote:
>>In article <·······················@192.0.2.1>,
>>Henry Baker <······@netcom.com> wrote:
>>] IMHO the major gain from compaction is to defragment the address space,
>>] but as Paul's recent work has shown, one may not have to do this very
>>] often.

>And Paul Wilson writes in <··········@jive.cs.utexas.edu>

>> Right. For "most" programs (that have been studied in enough detail, that
>> is), it looks like you never have to do it at all.  Nobody's studied
>> really long-running programs, though, and it's not clear whether
>> fragmentation will accumulate over time, or stabilize at an acceptable
>> level.

>The longest running,  biggest mix, garbage  collected systems are  probably
>those in which the operating system  itself is subject to the same  garbage
>collector as  user-programs.  Experience  with  the  Multipop  time-sharing
>system over  2 decades  ago showed  that compaction  was necessary  in  the
>circumstances pertaining then, at  least in a  small machine (384KB).  

While many individual programs tend to exhibit the behavior that Paul
describes, I would certainly not want to gamble the future of automatic
garbage collection on overgeneralization of this empirical observation.
There are a number of reasons why it would be wise to hedge your bets:

One reason that defragmentation is often not necessary is because
many of the programs that have been studied empirically allocate
a large proportion of their objects from a small number of 
predetermined size classes.  This is not necessarily representative
of the "typical" application of the future:

     a)  The tendency to allocate objects in predetermined sizes is
         in part a consequence of certain programming language
         implementation designs (e.g. Lisp uses dotted pairs to
         represent "everything") and partly the influence of static
         programming language tunnel vision.  One of the indirect benefits
         of garbage collection is that it frees programmers from these
         restrictive mindsets.  Cdr-coded Lisp lists are likely to 
         vary greatly in length.  Does anyone have statistics?  In
         Icon, efficiency of string operations was a major design
         objective.  Since the implementation of strings is so efficient,
         Icon programmers feel very comfortable creating strings ranging
         in length from several bytes to tens of kilobytes.  Some 
         real-world programs even create strings that exceed a hundred
         kilobytes.  Icon's implementations of lists and dynamic hash tables
         also benefit greatly from the ability (made possible through
         automatic compacting garbage collection) to size internal objects 
         in whatever sizes are most appropriate for run-time efficiency.

     b)  Consider the implementation of digital TV studio software.  The
         "programmer" is likely to need quick access to a large number of
         digital audio and video clips.  The lengths of these clips will
         be highly unpredictable.  I can envision the day when these 
         audio-visual clips would be as easily manipulated as Icon strings
         are manipulated today.

     c)  Consider the implementation of a PDA system.  Suppose the system
         has no disk.  The PDA is required to retain its user's appointment
         calendar (with variable amounts of information representing each
          day's scheduled and logged activities), contact lists (with variable
          amounts of information associated with each contact), scratch pads 
         (including spreadsheets, ASCII notes, untranslated pen scribbles,
          vector graphics, email memos sent and received, faxes sent
          and received), voice notes (annotations to the scratch-pad 
          information, personal dictations, voice mail forwarded from
          the central office), executable code segments (for each of the
          supported applications), etc.  Clearly the sizes of allocated
         segments are highly unpredictable.  Now, add another complicating
         factor: compress all objects that are larger than a particular
         threshold size and have not been accessed within the last N
         minutes.   The compression factor itself is unpredictable.

In summary, the notion that "defragmentation of dynamic memory is not
necessary" is a self-fulfilling prophecy.  In the absence of compaction
and/or in environments that provide seemingly unbounded amounts of
virtual memory, programmers are likely to develop software in ways that
do not depend on defragmentation, but there are real, non-trivial costs 
associated with this style of development.


-- 

Kelvin Nilsen/Dept. of Computer Science/Iowa State University/Ames, IA  50011 
 (515) 294-2259   ······@cs.iastate.edu  uunet!cs.iastate.edu!kelvin
From: Henry Baker
Subject: Re: allocator and GC locality.
Date: 
Message-ID: <hbaker-0708952138580001@192.0.2.1>
In article <················@kickapoo>, ······@cs.iastate.edu (Kelvin
Nilsen) wrote:

> ···@cs.umass.edu ( Robin Popplestone ) writes:
> 
> >> In article <··········@info.epfl.ch>,
> >> Stefan Monnier <··············@epfl.ch> wrote:
> >>In article <·······················@192.0.2.1>,
> >>Henry Baker <······@netcom.com> wrote:
> >>] IMHO the major gain from compaction is to defragment the address space,
> >>] but as Paul's recent work has shown, one may not have to do this very
> >>] often.
> 
> >And Paul Wilson writes in <··········@jive.cs.utexas.edu>
> 
> >> Right. For "most" programs (that have been studied in enough detail, that
> >> is), it looks like you never have to do it at all.  Nobody's studied
> >> really long-running programs, though, and it's not clear whether
> >> fragmentation will accumulate over time, or stabilize at an acceptable
> >> level.

I agree with Kelvin about not giving up on compaction completely, and disagree
with Paul that you _never_ have to do it.

While a significant fraction of the best brains in the computer business over
the past 30 years have been aimed at optimizing architectures so that objects
don't have to move (and thereby inadvertently penalizing any programs that
actually _do_ move objects), they haven't completely succeeded at beating
Robson and/or the second law of thermodynamics (in the sense I used it
in my Sigplan Notices paper).  Therefore, up until the point at which
your address space is so incredibly trashed that you can't allocate anything
of any reasonably size any more, your program will actually run pretty
well, because the memory hierarchy stuff (VM, cache) are doing all of the
dynamic moving of data for you, without your conscious control, and indeed
without much ability on your part to affect it much even if you wanted to.

The problem, like that of many GC systems, is that 'reorganizing' address
space is a relatively massive and expensive undertaking, and will not
happen very fast.  Furthermore, it is the kind of process that memory
hierarchies hate because their locality assumptions fail completely in
the face of such reorganization.

Therefore, unless one can somehow organize this process so that it takes
a relatively small percentage of the system, and doesn't interfere too
much with the normal operation of the program, the cure of reorganization
may be worse than the disease of fragmentation.

However, neither the disease nor the cure need be fatal, but they do have
to be approached with a great deal of subtlety.

If you have enough memory, you can either incrementally copy the stuff
to another processor, and let it do the reorganization for you (easier
in this case than in the reachability GC case), or have the same processor
do a kind of background reorganization task.  It is important, IMHO, that
the background task _know_ it is a background task, and be able to inform
the memory hierarchy of its second-class status, so that its activities
will not hurt the more important activities of the application program.

Incremental copying GC's try to do these things in this way (within the
constraints of the HW & OS).  The only slight variation is that with the
way the current generation of HW architectures are optimized, it may not
be optimum to copy _all_ the time, but only once in a while.

-- 
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html
From: Thomas Wang
Subject: Re: allocator and GC locality.
Date: 
Message-ID: <DD0722.Hst@cup.hp.com>
Henry Baker (······@netcom.com) wrote:

> Therefore, up until the point at which
> your address space is so incredibly trashed that you can't allocate anything
> of any reasonably size any more, your program will actually run pretty
> well, because the memory hierarchy stuff (VM, cache) are doing all of the
> dynamic moving of data for you, without your conscious control, and indeed
> without much ability on your part to affect it much even if you wanted to.

Precisely.  No relocation/compaction is likely to have the best performance
for this reason.  Pathological fragmentation is still a problem, which need
to be addressed.  Interestingly enough, present day VM hardware is already
sufficient to minimize the fragmentation problem through NOT MAPPING the
free fragments.  So the true problem lies in OS software, which provides no
API to portably access this feature.  What needs to be done is for the
researchers to present a standard package of API to various OS standard bodies
to make sure wide adoption of various VM access functions.

In one sample implementation, I imagine a non-moving incremental garbage
collector with a software write barrier.  This collector uses a binary buddy
system memory allocator, which minimizes external fragmentation.  Internal
fragmentation is minimized through VM hardware of not mapping the free
fragments.  Sub-page fragmentation is minimized by grouping small objects
into pools.  This collector would have good interactive performance, and
should be cache friendly as well.

Regards.

> www/ftp directory:
> ftp://ftp.netcom.com/pub/hb/hbaker/home.html

  -Thomas Wang              (Software Engineer, Enterprise Objects Program)
   ····@cup.hp.com          http://hpeopose.cup.hp.com/~wang/wang.html
From: Hans Boehm
Subject: Re: allocator and GC locality.
Date: 
Message-ID: <40dfmv$lcd@news.parc.xerox.com>
····@cup.hp.com (Thomas Wang) writes:

>Precisely.  No relocation/compaction is likely to have the best performance
>for this reason.  Pathological fragmentation is still a problem, which need
>to be addressed.  Interestingly enough, present day VM hardware is already
>sufficient to minimize the fragmentation problem through NOT MAPPING the
>free fragments.  So the true problem lies in OS software, which provides no
>API to portably access this feature.  What needs to be done is for the
>researchers to present a standard package of API to various OS standard bodies
>to make sure wide adoption of various VM access functions.

>In one sample implementation, I imagine a non-moving incremental garbage
>collector with a software write barrier.  This collector uses a binary buddy
>system memory allocator, which minimizes external fragmentation.  Internal
>fragmentation is minimized through VM hardware of not mapping the free
>fragments.  Sub-page fragmentation is minimized by grouping small objects
>into pools.  This collector would have good interactive performance, and
>should be cache friendly as well.

I'm all in favor of portable APIs to access the VM system.  But it's
important to get the details right.  I think you want a call that marks
certain pages "disposable", i.e. the operating system may replace them
by zero-filled pages AT ITS DISCRETION.  An mmap/munmap style interface
that gives the client explicit control over what's mapped doesn't seem
quite right.  In particular, remapping a previously unmapped page is
unavoidably expensive, even if there was no demand for the physical page.
Security issues require that the page be cleared or somehow overwritten
every time you do that.  (This is another reason I'm not convinced that
malloc implementations should use munmap to return space to the OS.)

The right primitve would generally allow an OS not to page out free space.
It is thus useful in many contexts.  But I'm not sure that the desire to
use binary buddy allocation is particularly convincing.  Since this allocator
presumably only works on page-sized or larger chunks, its running time
is likely to be in the noise compared to object initialization time.  Thus
you might as well use something that makes more efficient use of the
address space (e.g. some clever best fit implementation).

The other issue that needs to be handled somehow is what happens when you
run out of swap space.  I would want either AIX-style "low swap space"
warning signals, or an explicit call, which may fail, to reclaim a
"disposable" page.  A lot of real software at least tries to handle running
out of memory gracefully.  It succeeds enough of the time that you have to
continue to give it a chance.

I'd also love to see a cheap way to retrieve "page dirty" information
from the OS.  A software write barrier isn't always practical (e.g.
when dealing with third-party libraries.) And for some algorithms
dirty bits suffice as a VM write-barrier.  And they're
typically cheap to get from the hardware.

Hans-J. Boehm
(·····@parc.xerox.com)
Standard disclaimer ...
From: Henry Baker
Subject: Re: allocator and GC locality.
Date: 
Message-ID: <hbaker-1008951349590001@192.0.2.1>
In article <··········@news.parc.xerox.com>, ·····@parc.xerox.com (Hans
Boehm) wrote:

> I'm all in favor of portable APIs to access the VM system.  But it's
> important to get the details right.  I think you want a call that marks
> certain pages "disposable", i.e. the operating system may replace them
> by zero-filled pages AT ITS DISCRETION.  An mmap/munmap style interface
> that gives the client explicit control over what's mapped doesn't seem
> quite right.  In particular, remapping a previously unmapped page is
> unavoidably expensive, even if there was no demand for the physical page.
> Security issues require that the page be cleared or somehow overwritten
> every time you do that.  (This is another reason I'm not convinced that
> malloc implementations should use munmap to return space to the OS.)

As many VM's and/or disks are already starting to compress pages when
swapping them out, the cost of zero'ing is no larger than the cost of
compressing, and compressing a page of zeros is moderately easy.  So
your concerns about the cost of zeroing out pages is probably misplaced.

If forced to live with an interface, I'd be happy with the following
simple interface:

The OS notices (either by compression or other means) that a page is
all 0's.  In this case, it doesn't bother writing the page to disk,
and free's the disk page as well.  (If you're still concerned about
security, then you're going to have to rewrite the page, or start
leaning on your disk vendors to provide destructive reads a la
linear logic.)  Later, if the page is referred to again, the VM
can reconstitute it, again without any disk I/O.  The same trick
will also work for caches.

The OS should also provide a call to explicitly zero a region of memory.
The first and last pages of the region can be zeroed explicitly, with
the remainder of the pages being zero implicitly by declaring them to
be 0's in the page table.  This is an optimization, since the user can
achieve the same effect more clumsily by zeroing the pages himself.

Finally, the OS could provide some method for 'write-allocate' in VM.  One
scheme might be to keep track of a compact region of a page that has been
written to.  Any reads will force a read from the disk, and a merging of
the new & old information.  Another method might be to do some kind of
checkpoint of the program at the point that it starts writing to the
page, and if it ever reads before the page is fully initialized, then
the program is restarted with the 'real' page.  Yet another scheme would
be to explicitly store the write-allocate bits from the cache in real
memory when the cache flushes.  If the page has write permission, but
not read permission, then a trap will occur if the program attempts
to access the page.

> I'd also love to see a cheap way to retrieve "page dirty" information
> from the OS.  A software write barrier isn't always practical (e.g.
> when dealing with third-party libraries.) And for some algorithms
> dirty bits suffice as a VM write-barrier.  And they're
> typically cheap to get from the hardware.

I've been trying since 1974 to get "page dirty" information from the VM with
relatively little success.

-- 
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html
From: Paul Wilson
Subject: fragmentation, compression, etc. (was Re: allocator and GC locality).
Date: 
Message-ID: <40jbvl$qgv@boogie.cs.utexas.edu>
In article <················@kickapoo>,
Kelvin Nilsen <······@cs.iastate.edu> wrote:
>···@cs.umass.edu ( Robin Popplestone ) writes:
>
>>> In article <··········@info.epfl.ch>,
>>> Stefan Monnier <··············@epfl.ch> wrote:
>>>In article <·······················@192.0.2.1>,
>>>Henry Baker <······@netcom.com> wrote:
>>>] IMHO the major gain from compaction is to defragment the address space,
>>>] but as Paul's recent work has shown, one may not have to do this very
>>>] often.
>
>>And Paul Wilson writes in <··········@jive.cs.utexas.edu>
>
>>> Right. For "most" programs (that have been studied in enough detail, that
>>> is), it looks like you never have to do it at all.  Nobody's studied
>>> really long-running programs, though, and it's not clear whether
>>> fragmentation will accumulate over time, or stabilize at an acceptable
>>> level.
>
>While many individual programs tend to exhibit the behavior that Paul
>describes, I would certainly not want to gamble the future of automatic
>garbage collection on overgeneralization of this empirical observation.

I wouldn't either.  I'm not arguing that compaction is not a good thing;
in the best of all possible worlds most people would use languages and
compilers and hardware that are friendly to copying garbage collection.

I *will* argue, though, that the importance of compaction appears to
be widely overestimated, and that there's some data that suggests that
most programs are ``well behaved'' with respect to fragmentation, if
you pick a good allocator.  Unfortunately, this appears to be true for
most of the data I've seen (and gathered) but there doesn't seem to be
nearly enough data to know how true it is.

I also think that some of the exceptional cases (where fragmentation is
problematic) can be dealt with reasonably well by hacking implementations
of data structures that respect granularity issues.

>There are a number of reasons why it would be wise to hedge your bets:
>
>One reason that defragmentation is often not necessary is because
>many of the programs that have been studied empirically allocate
>a large proportion of their objects from a small number of 
>predetermined size classes.  This is not necessarily representative
>of the "typical" application of the future:
>
>     a)  The tendency to allocate objects in predetermined sizes is
>         in part a consequence of certain programming language
>         implementation designs (e.g. Lisp uses dotted pairs to
>         represent "everything") and partly the influence of static
>         programming language tunnel vision.  One of the indirect benefits
>         of garbage collection is that it frees programmers from these
>         restrictive mindsets.  Cdr-coded Lisp lists are likely to 
>         vary greatly in length.  Does anyone have statistics?

My impression is that CDR-coding is dead, and I'll bet it stays dead.
It's just too slow on stock hardware, and it's not clear it's really worth
it on standard hardware.

Compressed paging is more general, and easier to support with reasonable
efficiency.  It also has the potential to hide a multitude of data
representation sins, which is good---you can use fast "horizontal"
representations, and compress them down to something more nearly optimal
in the VM (or even cache) system(s).  (See the paper "Operating System
Support for Small Objects in our ftp repository mentioned in my .sig.)

>         In Icon, efficiency of string operations was a major design
>         objective.  Since the implementation of strings is so efficient,
>         Icon programmers feel very comfortable creating strings ranging
>         in length from several bytes to tens of kilobytes.  Some 
>         real-world programs even create strings that exceed a hundred
>         kilobytes.  Icon's implementations of lists and dynamic hash tables
>         also benefit greatly from the ability (made possible through
>         automatic compacting garbage collection) to size internal objects 
>         in whatever sizes are most appropriate for run-time efficiency.

I think almost all real scalable data structures allow for some variation
in the granule size, so that library implementors can pick sizes that
don't tend to cause a lot of obnoxious fragmentation.  This is not as nice
as having fragmentation automatically "go away" due to compaction, but it'll
do in a pinch, and I think non-compacting memory management will generally
have a slight effciency edge---or a large one, for large objects.

>     b)  Consider the implementation of digital TV studio software.  The
>         "programmer" is likely to need quick access to a large number of
>         digital audio and video clips.  The lengths of these clips will
>         be highly unpredictable.  I can envision the day when these 
>         audio-visual clips would be as easily manipulated as Icon strings
>         are manipulated today.

I'm not sure I follow this.  I suspect that scaleable data structures are
a good idea in both cases, and that in both cases the fragmentation problem
can be minimized by the appropriate granularity choices in implementing
the data structures.  The fact that strings and copressed audio or video
don't generally require contiguous storage means that a fixed chunk sizes
can be used without much cost in either internal or external fragmentation.

>     c)  Consider the implementation of a PDA system.  Suppose the system
>         has no disk.  The PDA is required to retain its user's appointment
>         calendar (with variable amounts of information representing each
>          day's scheduled and logged activities), contact lists (with variable
>          amounts of information associated with each contact), scratch pads 
>         (including spreadsheets, ASCII notes, untranslated pen scribbles,
>          vector graphics, email memos sent and received, faxes sent
>          and received), voice notes (annotations to the scratch-pad 
>          information, personal dictations, voice mail forwarded from
>          the central office), executable code segments (for each of the
>          supported applications), etc.  Clearly the sizes of allocated
>         segments are highly unpredictable.  Now, add another complicating
>         factor: compress all objects that are larger than a particular
>         threshold size and have not been accessed within the last N
>         minutes.   The compression factor itself is unpredictable.

Right, but compressed data don't have to be stored contiguously.  They
can be divided into fixed-sized chunks that are trivial to manage without
significant fragmentation, and strung together during uncompression to
"materialize" a contiguous object (or page) on demand.  (This is not
hypothetical, by the way---it's from experience with real compressed paging.)

>In summary, the notion that "defragmentation of dynamic memory is not
>necessary" is a self-fulfilling prophecy.  In the absence of compaction
>and/or in environments that provide seemingly unbounded amounts of
>virtual memory, programmers are likely to develop software in ways that
>do not depend on defragmentation, but there are real, non-trivial costs 
>associated with this style of development.

True, but I don't think the costs are huge.  And in the long run, I'm on
your side---when we figure out what the *real* issues are in storage
management, we should evolve towards more flexible systems that can support
more nearly ideal memory management.

On the other hand, it's not clear what fragmentation *is* costing.  Given
the pathetic state of allocator research at this point, I wouldn't be
surprised if a coupla billion dollars worth of RAM is going to waste
right now due to dumb allocators.  (Given the widespread use of ad hoc
and brain-dead allocators in C++, I'd be surprised if a coupla billion
dollars worth of RAM *isn't* going to waste right now.)   My agenda at
the moment is to fix that problem.  I suspect that if billions of dollars 
are being wasted at the moment, there are some easy fixes that will fix
*most* of the problem.  (See our allocator survey, also in our ftp 
repository.)  Then we get into the fancy stuff.

-- 
| Paul R. Wilson, Comp. Sci. Dept., U of Texas @ Austin (······@cs.utexas.edu)
| Papers on memory allocators, garbage collection, memory hierarchies,
| persistence and  Scheme interpreters and compilers available via ftp from 
| ftp.cs.utexas.edu, in pub/garbage or http://www.cs.utexas.edu/users/oops/)      
From: Hans Boehm
Subject: Re: allocator and GC locality.
Date: 
Message-ID: <4044qp$b63@news.parc.xerox.com>
···@cs.umass.edu ( Robin Popplestone ) writes:

>The longest running,  biggest mix, garbage  collected systems are  probably
>those in which the operating system  itself is subject to the same  garbage
>collector as  user-programs.  Experience  with  the  Multipop  time-sharing
>system over  2 decades  ago showed  that compaction  was necessary  in  the
>circumstances pertaining then, at  least in a  small machine (384KB).  Both
>data and code-blocks were garbage collected, and the job-mix included
>robotic programs that had largish digitised images as well as classic
>list processing such as the (original) Boyer-Moore theorem prover.
>The garbage collector operated in-situ (no virtual memory!), but had a
>compaction pass which was conditionally invoked. Without compaction, the
>system would rapidly become unusable and have to be rebooted. With
>compaction it would run for a (small) number of days.

I think this all depends on so many factors (e.g allocation algorithms,
presence of VM, scarcity of physical memory) that it is very hard to
generalize.  My Cedar environment stays up for at least weeks 
with a purely noncompacting collector.  Though it runs as a
group of SunOS processes, it also has thread switcher and many client
threads (mailers, editors, etc.) running in one address space.  I don't
notice any performance deterioration.  But the heap rarely has more than
25 MB or so live data, and recently I've always had at least 64MB of
physical memory.

>How this experience would scale to a modern environment is not obvious.

Agreed.

I suspect that the right kind of compaction is occasionally useful.  But
you can go a long way without it.

Hans-J. Boehm
(·····@parc.xerox.com)
Standard disclaimer ...
From:  Robin Popplestone
Subject: +  4890	Re: allocator and GC locality.
Date: 
Message-ID: <405tp7$efk@kernighan.cs.umass.edu>
Hans-J. Boehm (·····@parc.xerox.com) writes in
<··········@news.parc.xerox.com>

> I think this  all depends on  so many factors  (e.g allocation  algorithms,
> presence of  VM, scarcity  of physical  memory)  that it  is very  hard  to
> generalize. My Cedar environment stays up for at least weeks with a  purely
> noncompacting collector. Though it runs as  a group of SunOS processes,  it
> also has thread switcher and  many client threads (mailers, editors,  etc.)
> running in one address space. I don't notice any performance deterioration.

One crude measure of the effect of fragmentation is:

size of store
-------------
size of biggest stored object

This was about 500 for Multipop, assuming we stored visual data for the robot
by separate rows. For Cedar, I imagine it could be over a hundred thousand,
though you don't tell us how for example the editors store their text.
However, those who speak of megabyte single objects are certainly speaking of
systems that could get seriously fragmented.

> But the heap rarely has more than 25 MB or so live data, and recently  I've
> always had at least 64MB of physical memory.

Multipop often ran close to having no free memory whatever, at which point
users who exceeded their memory quota would be thrown off. Thus the compactor
had to realise all the available store so that we could use almost all of
it in some applications.

> I suspect that the right kind of compaction is occasionally useful.  But
> you can go a long way without it.

In a modern environment, compaction is more a matter of efficiency, determined
by interactions between the garbage collector and the various memory systems.
However systems that can do compaction (or in general relocation of objects)
do possess degrees of freedom that systems that cannot relocate objects
do not possess. This freedom offers potential for enhancing performance.


Robin Popplestone.
From:  Robin Popplestone
Subject: Re: allocator and GC locality.
Date: 
Message-ID: <4059v8$8q4@kernighan.cs.umass.edu>
Keywords:
·········@cs.utexas.edu

Paul Wilson writes:

> The fact that this system could run for a day or more without
> accumulating pathological fragmentation is a good sign, up to a point.

I think you have misunderstood me - my earlier message stated:

> Without compaction, the system would rapidly become unusable and have to
> be rebooted. With compaction it would run for a (small) number of days.

That is *terminal* fragmentation set in in about an hour of usage if the
compaction algorithm was not in use. With the compaction algorithm, it
would in principle run indefinitely (given that users who exceeded their
allocated resources could be killed). However there was a bug somewhere in
the system (we suspected the hardware) which resulted in a mean time
between failure of something over 1 day. Even programs like the Boyer-Moore
prover, whose data-usage was mostly list-cells, would cause fragmentation as
they were incrementally re-compiled during debugging because the code-blocks
were of disparate sizes

It is possible that Multipop will be revived in the Royal Scottish
Museum, which has acquired the necessary hardware to do so. The (paper)
tapes of at least one version of the system also exist, so
the performance can possibly be verified.

Paul Wilson also writes:

> I find this pretty interesting.  I take it this was a unified heap,
> like in a Lisp machine system, where you don't have the usual process
> memory boundaries, but unlike a Lisp machine in that you seldom do
> compaction very often.
...

From (my wetware) memory (this could in principle be verified from
extant listings):

The memory belonging to each process was NOT segregated. Security
depended solely on the fact that the one compiler (for POP-2) generated
well behaved code of a quality roughly comparable to a simple LISP
compiler. Store usage by a user was accounted for by the garbage collector,
which starting from process records, added the size of each marked object
to a variable. A user exceeding his/her allocation could be killed by the
machine operator (!) setting a sense-key on the console; this was
generally done if the machine seemed to be g.c. thrashing.

> Unified heaps for multiple processes are interesting, because nobody
> knows whether the intermixing of objects created and used for different
> concurrent processes tends to increase fragmentation, or reduce it,
> or what.  (And I think that's likely to be very allocator dependent.)

The earliest design for Multipop68 called for a "zoo of objects in
cages", i.e. each class of object would be held in a separate area of
store, so that type could be determined by location. Thus user-specific
data-types would have been separated, though common types (e.g. lists,
strings) would have been put in the same cage. This "zoo" was abandoned in
favour of the "key cell" system (q.v.), due to Mike Foster, partly because
of the greater complication of the zoo, and partly because we suspected
that any savings arising from the encoding of type in the address would be
lost in the existence of unusable space in partly-full cages.

A more conventional time sharing system (Multipop 70) was
developed for the target machine. It was universally disliked
for bad performance. However that arose from the limitations of the discs,
combined with a small main memory, so has no relevance to the
fragmentation issue.

> Is there a more detailed description of the allocator strategy?  (E.g.,
> did the allocator attempt to separte objects created by different
> processes, or just mix them together in allocation order, or what?)

The allocator maintained separate free lists for small objects (up to about
five words in length I recall). Longer objects were allocated by searching
a free list until one of the right length or larger was found. These free
lists were global to the system. If no object was found, a garbage
collection was started. (This meant of course that a user who turned over
store rapidly got an unfair share of machine resources).

I recall that we used a dedicated area of store to hold the
old-address->new address map for the purposes of compaction - thus more
than one compaction might be needed to recover severely fragmented space.

Each object in memory had one pointer to a "key cell" which contained
data-class-common information, including two methods needed by the garbage
collector, a mark-and-trace method and a relocate method (we did not call
them methods of course...). A data-class whose members varied in length
(e.g. strings) had an additional field in each instance which specified the
length. Users could define their own data-classes.

A process record contained (a) 2 stacks for each user (b) a dictionary
mapping identifiers to store locations (all variables were in the LISP
sense special, with closures done by what was later to be called "lambda
lifting") (c) a copy of the special variables of the compiler. A context
switch involved saving the special variables (including the PC) of the user
being swapped out, and restoring those of the user being swapped in.

Thus each user had one process. Later (around 1970) continuations were
added to POP-2, but were never used as the basic process mechanism of
Multipop, being decidedly inefficient given the non-lexical nature of
variables.

The allocation was complicated by boundaries in the machine architecture -
references (variable-value slots) had to be in the 1st 32k words, and
program within the 1st 64k. In effect, this was a partial revival of the
"zoo" architecture.

Interrupt handling in Multipop was conventionally coded in assembly
language. The interface to the operating system used function-closures to
provide security. For example, a disc-file was presented as a closure whose
data-part was a buffer, and whose function part was the code to extract
characters from the buffer.

Robin Popplestone.
From: Henry Baker
Subject: Re: allocator and GC locality.
Date: 
Message-ID: <hbaker-0708952118360001@192.0.2.1>
In article <··········@kernighan.cs.umass.edu>, ···@cs.umass.edu ( Robin
Popplestone ) wrote:

[snip of lots of really interesting tidbits]

I _sincerely_ hope that this stuff is written down somewhere.  It's such
a waste of brainpower to recreate these things every generation.  This
mode makes each generation stand on its predecessor's toes instead of
its predecessor's shoulders.

Also, like the teenager who discovers that his father learned one heck
of a lot between the time the teenager was 15 and the time he was 21, many of us
would be interested in some of the more subtle details of what our
predecessors did, because there may be additional wisdom there that
wasn't evident the first or second time we thought about the problem.

-- 
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html
From: Thomas Wang
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <DCLIwB.D8w@cup.hp.com>
Paul Wilson (······@cs.utexas.edu) wrote:

> The memory lost to fragmentation in conventional allocators appears to be
> widely overestimated---at least for the best allocators.  This is partly
> because the allocator research area is such a mess that people use the
> wrong allocators (e.g., next fit) or don't tweak them in the right ways
> (e.g., getting rid of footer overhead for boundary tags).

> Our recent results show good allocators having only about 10% fragmentation.
> This is comparable to the losses you get just from a header word and rounding
> object sizes up to a word or double word for architectural reasons.

> (Some caveats are in order---this is for a set of eight varied C and C++
> programs, but there are lots of kinds of programs that aren't represented
> in *any* small set, and none of our programs runs for a very long time.)

For some contrasting point, our commercial DBMS server initially had some
run-away fragmentation.  We used a work around of keeping an internal free
list that is never freed.  Frequent usage of mega-byte sized buffers
contributed to the memory fragmentation problem.  You know, it is going to be
these large multi-million dollar projects that are going to have this kind
of problems.  On second thought, multi-media applications can run into
fragmention problem as well.

> It's not clear to me that compaction is a win, on average.  If objects
> are placed correctly in the first place, it may not be necessary and it
> may be *bad* for locality on average.  The allocation order and sizes
> of objects are likely to provide very strong hints as to which objects
> will be used together.  This info should be usable for decreasing
> fragmentation and for increasing locality.

I believe the most realistic control on fragmentation can be achieved by
giving language writer access to virtual memory function to free memory pages.
Making large fragments not take up memory space is such an obvious idea.
This solution should knock out the dirty cache problem, as well as be
applicable to conservative collectors.

A master thesis quality paper can be done by implementing this scheme on AIX,
then do some performance analysis.

Regards.


> Paul R. Wilson, Comp. Sci. Dept., U of Texas @ Austin (······@cs.utexas.edu)


  -Thomas Wang              (Software Engineer, Enterprise Objects Program)
   ····@cup.hp.com          http://hpeopose.cup.hp.com/~wang/wang.html
From: Henry Baker
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <hbaker-2807951148460001@192.0.2.1>
In article <··········@info.epfl.ch>, "Stefan Monnier"
<··············@epfl.ch> wrote:

> Too true: current systems are designed mainly to run stack-based programs, so
> it's always safer to try to stick as much as possible to the same paradigm.
> Appel's argument about GC and stack being just as fast (while the GC
approach is
> simpler since there *is* a GC anyway) is not convincing enough: the added
> complexity buys you predictability (and it's a "one-time" complexity cost).
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Oh, but it's _not_ a "one-time" complexity cost!  Just look at that thousands
of papers over the past 25 years about how to force everything into the
Algol/PLI/Pascal/C/C++ "everything has to be stack-based" model.

Somehow, we keep paying, and paying, and paying for this 'one-time' cost.

---------

[For those of you who will be confused by this posting because of my
previous postings advocating stacks, you should know that the stacks
I was advocating are quite a bit different from the Algol/PLI/Pascal/C/C++
model.  If you _have_ to program in one of these dinosaurs, you should
definitely consider using the "Cheney on the MTA"/pushy stack buffer
approach:  ftp://ftp.netcom.com/pub/hb/hbaker/CheneyMTA.html  (also .ps.Z).]

-- 
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html
From: Paul Flinders
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <PTF.95Aug7083817@fat-controller.cs.bham.ac.uk>
In article <··········@fido.asd.sgi.com> ·······@engr.sgi.com (Mike McDonald) writes:

  >   A page fault or cache miss between accessing the length and the
  > contents cost one heck of a lot! I wouldn't be surprised that with
  > a lot of allocators, the miss happens often enough to cause the 3x
  > difference in speed overall.

Ok, that's possible - but if you're doing much string manipulation
then you probably have more than one string - in which case arrange
the storage so that the class objects which hold the lengths are close
together in memory - then they're likely to stay in core.

TTFN

Paul.
From: Cliff Click
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <CLIFFC.95Aug7173142@crocus.hpl.hp.com>
·····@parc.xerox.com (Hans Boehm) writes:

> So we're ignoring a potentially exponential complexity difference, and
> we're concentrating on the factor of 2 in the asymptotically inferior
> solution?  Why?

4. We didn't know ["ropes", "cords"] existed.
3. Strings aren't a bottleneck, so we don't care.
2. We are worried that the constant factor of the asymptotically better
   solution will be too large, 'cause we'll never hit the bad case in
   our applications.

and the number 1 reason is:

1. We learned C strings first, so that's what we'll always use.

     ( Triple smileys all around, except for the #1 reason :-P )


Cliff

--
Cliff Click                  Compiler Research Scientist
Cambridge Research Office,   Hewlett-Packard Laboratories
One Main Street, 10th Floor, Cambridge, MA 02142
(617) 225-4915  Work         (617) 225-4930  Fax
······@hpl.hp.com            http://bellona.cs.rice.edu/MSCP/cliff.html
From: James Kanze US/ESC 60/3/141 #40763
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <KANZE.95Aug9150054@slsvhdt.lts.sel.alcatel.de>
In article <··········@fido.asd.sgi.com> ·······@engr.sgi.com (Mike
McDonald) writes:

|> In article <······@bmtech.demon.co.uk>, Scott Wheeler <······@bmtech.demon.co.uk> writes:


|> |> class str {
|> |>     int iCount;
|> |>     char achData[256];
|> |> };
|> |> 
|> |> with the obvious problems in terms of fixed maximum length of the 
|> |> string.

|>   Yuch! In C, (I don't know how to do it in C++) you'd do something like:

|> typedef struct string {
|>   int length;
|>   char contents[0];
|>   } *String;

Not legal, at least not in C (or C++).  Arrays may not have length of
0.

|> String make_string(int n)
|> {
|>   String s;

|>   s = (String) calloc(1, sizeof(struct string) + n);
|>   if (s == NULL)
|>     you_lose();
|>   s->length = n;
|>   return (s);
|> }


|> (If your C compiler doesn't like arrays of length zero, declare contents as
|> length 1 and subtract 1 from the sizeof(struct string).)

But you still cannot *access* the remaining memory.  At least not
easily.  Accessing contents with an index > 0 is undefined behavior;
with any implementation of reasonable quality, it will cause a bounds
check error.  (Of course, I've never seen an implementation of C with
reasonable quality, so you may be safe.  Although I seem to have heard
somewhere that there is one, Centerline, maybe?)

Interestingly enough, there is a legal (and safe) way of implementing
this same idiom in C++.  It's so ugly, though, that I'm not going to
post it.  (I know, if you were worried about ugliness, you wouldn't be
using C++ anyway:-).)

|>   This way, both the length and then data are allocated together in the heap. And
|> you haven't limited the max length artifically. (I suspect most BASIC systems do
|> this.)

I once ran benchmarks (with a very simple program, so probably not
indicative of anythinkg) of three reference counted implementations of
C++ strings: the classical one, the classical one with operator new
overloaded for the fixed length header class, and one using the above
ugly hack (and so only one allocation per string).  Under Solaris,
using the standard system malloc (called directly by new), there was
no significant difference in runtime.

-- 
James Kanze         Tel.: (+33) 88 14 49 00        email: ·····@gabi-soft.fr
GABI Software, Sarl., 8 rue des Francs-Bourgeois, F-67000 Strasbourg, France
Conseils en informatique industrielle --
                              -- Beratung in industrieller Datenverarbeitung
From: Henry Baker
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <hbaker-0908950821150001@192.0.2.1>
In article <··················@slsvhdt.lts.sel.alcatel.de>,
·····@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) wrote:

> Interestingly enough, there is a legal (and safe) way of implementing
> this same idiom in C++.  It's so ugly, though, that I'm not going to
> post it.  (I know, if you were worried about ugliness, you wouldn't be
> using C++ anyway:-).)

Go ahead & post it!  Hit us with your ugly stick!  We're adults, and
we've turned off the V-chip...

> |>   This way, both the length and then data are allocated together in
the heap. And
> |> you haven't limited the max length artifically. (I suspect most BASIC
systems do
> |> this.)
> 
> I once ran benchmarks (with a very simple program, so probably not
> indicative of anythinkg) of three reference counted implementations of
> C++ strings: the classical one, the classical one with operator new
> overloaded for the fixed length header class, and one using the above
> ugly hack (and so only one allocation per string).  Under Solaris,
> using the standard system malloc (called directly by new), there was
> no significant difference in runtime.

We may be interested in these results after we've seen the program.

-- 
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html
From: Sam Kendall
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <DD678p.44r@mv.mv.com>
·····@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) wrote:

>Accessing contents with an index > 0 is undefined behavior;
>with any implementation of reasonable quality, it will cause a bounds
>check error.  (Of course, I've never seen an implementation of C with
>reasonable quality, so you may be safe.  Although I seem to have heard
>somewhere that there is one, Centerline, maybe?)

That's right; CenterLine has a C/C++ interpreter that checks all sorts of
things, including array and pointer bounds.  There are other products,
notably Purify and similar things, that catch some but not all bounds
violations in these awful, unsafe languages.

My fantasy: what if 10% of the effort spent in developing these C/C++
error-checking tools (I've written major chunks of several of them) were spent
switching programmers over to a good, safe language like Modula-3 or Dylan?
Ah, but on these newsgroups I'm mostly preaching to the converted.

--Sam Kendall
From: James Kanze US/ESC 60/3/141 #40763
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <KANZE.95Aug10172327@slsvhdt.lts.sel.alcatel.de>
In article <·······················@192.0.2.1> ······@netcom.com
(Henry Baker) writes:

|> In article <··················@slsvhdt.lts.sel.alcatel.de>,
|> ·····@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) wrote:

|> > Interestingly enough, there is a legal (and safe) way of implementing
|> > this same idiom in C++.  It's so ugly, though, that I'm not going to
|> > post it.  (I know, if you were worried about ugliness, you wouldn't be
|> > using C++ anyway:-).)

|> Go ahead & post it!  Hit us with your ugly stick!  We're adults, and
|> we've turned off the V-chip...

Just the essentials (and ignoring eventual reference counting):

    class StringImpl
    {
    public :
        static StringImpl*  mkStringImpl( int length ) ;
        char*               buffer() ;

        void*               operator new( size_t s , int length ) ;
        void                operator delete( void* ) ;
    private :
                            StringImpl( int length ) ;
        int                 theLength ;
    } ;

    StringImpl*
    StringImpl::mkStringImpl( int length )
    {
        return new( length ) StringImpl( length ) ;
    }

    char*
    StringImpl::buffer()
    {
        return (char*)( this + 1 ) ;
    }

    void*
    StringImpl::operator new( size_t s , int length )
    {
        assert( s == sizeof( StringImpl ) ) ;
        return ::operator new( s + length ) ;
    }
    void
    StringImpl::operator delete( void* p )
    {
        ::operator delete( p ) ;
    }

    StringImpl::StringImpl( int length )
        :   theLength( length )
    {
    }

The purpose of the private constructor (and the static mkStringImpl)
is to ensure that 1) all elements *are* created on the heap, and 2)
the extra parameter to new and the parameter to the constructor are
identical.  (Safety, in sum.)

In my tests, all of the functions were made inline.

|> > |>   This way, both the length and then data are allocated together in
|> the heap. And
|> > |> you haven't limited the max length artifically. (I suspect most BASIC
|> systems do
|> > |> this.)
|> > 
|> > I once ran benchmarks (with a very simple program, so probably not
|> > indicative of anythinkg) of three reference counted implementations of
|> > C++ strings: the classical one, the classical one with operator new
|> > overloaded for the fixed length header class, and one using the above
|> > ugly hack (and so only one allocation per string).  Under Solaris,
|> > using the standard system malloc (called directly by new), there was
|> > no significant difference in runtime.

|> We may be interested in these results after we've seen the program.

I don't have the benchmark anymore, and forget exactly what I tested
(in addition to just making the strings).  I do remember being
surprised by the results, since the malloc under SunOS 4 is not
particularly efficient when a lot of small blocks are being allocated.

-- 
James Kanze         Tel.: (+33) 88 14 49 00        email: ·····@gabi-soft.fr
GABI Software, Sarl., 8 rue des Francs-Bourgeois, F-67000 Strasbourg, France
Conseils en informatique industrielle --
                              -- Beratung in industrieller Datenverarbeitung
From: Henry Baker
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <hbaker-1008951322190001@192.0.2.1>
In article <···················@slsvhdt.lts.sel.alcatel.de>,
·····@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) wrote:

> In article <·······················@192.0.2.1> ······@netcom.com
> (Henry Baker) writes:
> 
> |> In article <··················@slsvhdt.lts.sel.alcatel.de>,
> |> ·····@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) wrote:
> 
> |> > Interestingly enough, there is a legal (and safe) way of implementing
> |> > this same idiom in C++.  It's so ugly, though, that I'm not going to
> |> > post it.  (I know, if you were worried about ugliness, you wouldn't be
> |> > using C++ anyway:-).)
> 
> |> Go ahead & post it!  Hit us with your ugly stick!  We're adults, and
> |> we've turned off the V-chip...
> 
> Just the essentials (and ignoring eventual reference counting):
> 
>     class StringImpl
>     {
>     public :
>         static StringImpl*  mkStringImpl( int length ) ;
>         char*               buffer() ;

// Gain access to the actual string buffer.

>         void*               operator new( size_t s , int length ) ;

// Overlard misplacement 'new'.

>         void                operator delete( void* ) ;
>     private :
>                             StringImpl( int length ) ;

// private ctor means no static and/or stack instances; being 'private' means
// that derived classes can't create any, either.  No default ctor means no
// arrays or members.  No friends declared.  _Do_ have default assignment,
// so it isn't as safe as you might think.

>         int                 theLength ;
>     } ;
> 
>     StringImpl*
>     StringImpl::mkStringImpl( int length )
>     {
>         return new( length ) StringImpl( length ) ;
>     }
> 
>     char*
>     StringImpl::buffer()
>     {
>         return (char*)( this + 1 ) ;

// This may not work so well for arrays of doubles due to alignment problems.

>     }
> 
>     void*
>     StringImpl::operator new( size_t s , int length )
>     {
>         assert( s == sizeof( StringImpl ) ) ;
>         return ::operator new( s + length ) ;
>     }
>     void
>     StringImpl::operator delete( void* p )
>     {
>         ::operator delete( p ) ;
>     }
> 
>     StringImpl::StringImpl( int length )
>         :   theLength( length )
>     {
>     }
> 
> The purpose of the private constructor (and the static mkStringImpl)
> is to ensure that 1) all elements *are* created on the heap, and 2)
> the extra parameter to new and the parameter to the constructor are
> identical.  (Safety, in sum.)
> 
> In my tests, all of the functions were made inline.

You're right, it's ugly.  It's also no safer than something using explicit
casting.  For some strange reason :-), it _is_ efficient (at least this part,
w/o reference counting), since all of this stuff can be inlined.  The mere fact
that compiler writers have been so accomodating in this kind of thing is proof
positive that people have to do this kind of end-run around the type system
a lot.  (Or at least people at ATT have to do this a lot!)

But I think that most people would agree that you've subverted the C++ type
system to achieve your end, so 'legal' here would mean a language lawyer's
'legal', rather than 'ethical'.

-- 
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html
From: James Kanze US/ESC 60/3/141 #40763
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <KANZE.95Aug10145551@slsvhdt.lts.sel.alcatel.de>
In article <··········@news.parc.xerox.com> ·····@parc.xerox.com (Hans
Boehm) writes:

    [Concerning the quality, or lack thereof, of standard string
classes...] 
|> I don't think you should use string manipulation where it's inappropriate.
|> But many programs (editors, mailers, news readers, etc.) naturally
|> manipulate long strings.  Currently they mostly invent their own custom
|> representations, except where the authors decided not to bother.  In those
|> places they effectively break on unexpectedly long input.  Why not use
|> a robust representation as the standard, and then optimize in those
|> few places where it doesn't perform adequately, and it's safe to
|> optimize?

I've considered this a bit, too.  I know of several ways of
implementing a string class that are significantly faster than the
classical solution for specific operations.

The problem is, that the operations which need to be optimized are not
the same for different applications.  A compiler will need different
string handling than an editor, for example.

With this in mind, I'm not really convinced that the solution is to
try and create an optimal string class as a standard.  I rather think
of the standard string class as a facility for people like myself,
whose programs only do string handling secondarily (formatting error
messages, and the like).  If I were writing an editor, for example, I
would not expect the standard string class to be acceptable for my
text buffers.  In this regard, just about any string class with the
required functionality will do the trick.  (And it is more important
that the string class be easier to use than that it be fast.)

This does mean that most text oriented applications will have to
`re-invent the wheel', in that they will have to write their own
string class.  But I'm not convinced that there is a string class
which would be appropriate for all applications, anyway.
-- 
James Kanze         Tel.: (+33) 88 14 49 00        email: ·····@gabi-soft.fr
GABI Software, Sarl., 8 rue des Francs-Bourgeois, F-67000 Strasbourg, France
Conseils en informatique industrielle --
                              -- Beratung in industrieller Datenverarbeitung
From: James Kanze US/ESC 60/3/141 #40763
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <KANZE.95Aug16203305@slsvhdt.lts.sel.alcatel.de>
In article <··········@news.parc.xerox.com> ·····@parc.xerox.com (Hans
Boehm) writes:

|> ·····@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) writes:

|> ...

|> >With this in mind, I'm not really convinced that the solution is to
|> >try and create an optimal string class as a standard.  I rather think
|> >of the standard string class as a facility for people like myself,
|> >whose programs only do string handling secondarily (formatting error
|> >messages, and the like).  If I were writing an editor, for example, I
|> >would not expect the standard string class to be acceptable for my
|> >text buffers.  In this regard, just about any string class with the
|> >required functionality will do the trick.  (And it is more important
|> >that the string class be easier to use than that it be fast.)

|> >This does mean that most text oriented applications will have to
|> >`re-invent the wheel', in that they will have to write their own
|> >string class.  But I'm not convinced that there is a string class
|> >which would be appropriate for all applications, anyway.

|> I agree that some applications will need their own string class.
|> Nontheless, I think a standard string class is important.  Many libraries
|> will require strings as input parameters, or generate strings.  If you
|> want any hope of having such libraries play together, they will need
|> a standard string representation.

I agree that a standard string representation is important.  There are
a lot of programs (including most of my own) where text has simply a
support function; for these, it would be ridiculous to have to create
a special class to support it.

|> I claim the characteristics you want from a standard string class are:

|> 1) It should be easy to use, with a clean interface.  It should be
|> general enough to allow reusability of libraries that rely on it.

I agree.  This should be the single most important feature of a
general string class.

|> 2) It should be robust.  The implementation should scale reasonably
|> to large inputs, both in that it remains correct, and that its performance
|> remains reasonable.

I'm not totally convinced that support for very long strings is
necessary.  I do agree that it should be robust in the sense that 1)
it is itself free of errors, and 2) it detects and handles user errors
gracefully (for some definition of gracefully).

I also think that, in the context of C++, there is a third criteria
which is very important:

3) it must ``work'' with the old, '\0' terminated C style strings.
After all, this is what, for example, most window systems will expect
for a considerable time.

|> I would argue that a string class based on a conventional flat representation,
|> and whose interface exposes that, doesn't do very well on either criterion.

I agree.  At the very least, it fails the first criteria.  But there
is a fundamental conflict between this and the third criteria (which
obviously requires a means of getting a flat representation).

This said, I think that the current string class does allow internal
representations other than a flat one, on the condition of converting
itself to the flat representation whenever c_str is called.

|> It's often difficult or impossible to pass some other string representation
|> (e.g. a file) to a string function without actually changing representation
|> (e.g. reading the whole file).  A tree-based representation can, at minimal
|> cost, accomodate strings with user specified character access functions.

I don't see any possibility of passing a file to a ``standard'' string
class without reading it anyway.  Remember, you do not have access to
the internals of a standard string class to let it know about
alternative representations, like a file.  So if it isn't provided for
in the standard, you don't have it.

|> A flat representation does allow in-place string updates,
|> a facility I would personally rather leave out of a "standard" string class.
|> You still have arrays of characters when you need them.

I agree here.  My own string class did not support inplace updates,
and I've never really seen the need for them.  But a ``standard''
string class must represent a consensus, and most people seem to want
inplace updates.  (The only non-const function in my string class was
operator=.)

   [I've cut the text concerning scaling.  As I said earlier, I'm not
convinced that the general string class has to support long strings
that efficiently...]
-- 
James Kanze         Tel.: (+33) 88 14 49 00        email: ·····@gabi-soft.fr
GABI Software, Sarl., 8 rue des Francs-Bourgeois, F-67000 Strasbourg, France
Conseils en informatique industrielle --
                              -- Beratung in industrieller Datenverarbeitung
From: James Kanze US/ESC 60/3/141 #40763
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <KANZE.95Aug17152014@slsvhdt.lts.sel.alcatel.de>
In article <·······················@192.0.2.1> ······@netcom.com
(Henry Baker) writes:

|> Newsgroups: comp.lang.dylan,comp.lang.misc,comp.lang.lisp,comp.object,comp.arch
|> From: ······@netcom.com (Henry Baker)
|> Organization: nil organization
|> Date: Thu, 10 Aug 1995 21:22:19 GMT

|> In article <···················@slsvhdt.lts.sel.alcatel.de>,
|> ·····@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) wrote:

|> > In article <·······················@192.0.2.1> ······@netcom.com
|> > (Henry Baker) writes:
|> > 
|> > |> In article <··················@slsvhdt.lts.sel.alcatel.de>,
|> > |> ·····@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) wrote:
|> > 
|> > |> > Interestingly enough, there is a legal (and safe) way of implementing
|> > |> > this same idiom in C++.  It's so ugly, though, that I'm not going to
|> > |> > post it.  (I know, if you were worried about ugliness, you wouldn't be
|> > |> > using C++ anyway:-).)
|> > 
|> > |> Go ahead & post it!  Hit us with your ugly stick!  We're adults, and
|> > |> we've turned off the V-chip...
|> > 
|> > Just the essentials (and ignoring eventual reference counting):

Note the name of the class.  This is an implementation class, only to
be used by an interface class (string) that the client sees.  (Also,
its only purpose is to show that the ``ugly struct hack'' can actually
be written in C++.)

|> >     class StringImpl
|> >     {
|> >     public :
|> >         static StringImpl*  mkStringImpl( int length ) ;
|> >         char*               buffer() ;

|> // Gain access to the actual string buffer.

|> >         void*               operator new( size_t s , int length ) ;

|> // Overlard misplacement 'new'.

|> >         void                operator delete( void* ) ;
|> >     private :
|> >                             StringImpl( int length ) ;

|> // private ctor means no static and/or stack instances; being 'private' means
|> // that derived classes can't create any, either.  No default ctor means no
|> // arrays or members.  No friends declared.  _Do_ have default assignment,
|> // so it isn't as safe as you might think.

Correct.  I should have declared a private copy constructor and
assignment operator.

|> >         int                 theLength ;
|> >     } ;
|> > 
|> >     StringImpl*
|> >     StringImpl::mkStringImpl( int length )
|> >     {
|> >         return new( length ) StringImpl( length ) ;
|> >     }
|> > 
|> >     char*
|> >     StringImpl::buffer()
|> >     {
|> >         return (char*)( this + 1 ) ;

|> // This may not work so well for arrays of doubles due to alignment problems.

Quite correct.  In practice, I don't know of an architecture on which
it would fail, but this is actually just a coincidence: the size of
the two int's in the data correspond to (a multiple of) the size of a
double.

|> >     }
|> > 
|> >     void*
|> >     StringImpl::operator new( size_t s , int length )
|> >     {
|> >         assert( s == sizeof( StringImpl ) ) ;
|> >         return ::operator new( s + length ) ;
|> >     }
|> >     void
|> >     StringImpl::operator delete( void* p )
|> >     {
|> >         ::operator delete( p ) ;
|> >     }
|> > 
|> >     StringImpl::StringImpl( int length )
|> >         :   theLength( length )
|> >     {
|> >     }
|> > 
|> > The purpose of the private constructor (and the static mkStringImpl)
|> > is to ensure that 1) all elements *are* created on the heap, and 2)
|> > the extra parameter to new and the parameter to the constructor are
|> > identical.  (Safety, in sum.)
|> > 
|> > In my tests, all of the functions were made inline.

|> You're right, it's ugly.  It's also no safer than something using explicit
|> casting.

Yes and no.  The only advantage I can see in it is that it
encapsulates the ugliness in one small class, rather than spreading it
out throughout the application.  On the other hand, I don't use it in
my string class.

|> For some strange reason :-), it _is_ efficient (at least this part,
|> w/o reference counting), since all of this stuff can be inlined.  The mere fact
|> that compiler writers have been so accomodating in this kind of thing is proof
|> positive that people have to do this kind of end-run around the type system
|> a lot.  (Or at least people at ATT have to do this a lot!)

|> But I think that most people would agree that you've subverted the C++ type
|> system to achieve your end, so 'legal' here would mean a language lawyer's
|> 'legal', rather than 'ethical'.

I agree totally here, which is why I didn't post it to begin with.  I
don't use it; I thought it up more as a response to a challenge, than
as something I would actually want to use.  I certainly don't
recommend programming this way.
-- 
James Kanze         Tel.: (+33) 88 14 49 00        email: ·····@gabi-soft.fr
GABI Software, Sarl., 8 rue des Francs-Bourgeois, F-67000 Strasbourg, France
Conseils en informatique industrielle --
                              -- Beratung in industrieller Datenverarbeitung
From: James Kanze US/ESC 60/3/141 #40763
Subject: Re: allocator and GC locality (was Re: cost of malloc)
Date: 
Message-ID: <KANZE.95Aug17153033@slsvhdt.lts.sel.alcatel.de>
In article <··········@mv.mv.com> Sam Kendall <kendall> writes:

|> ·····@lts.sel.alcatel.de (James Kanze US/ESC 60/3/141 #40763) wrote:

|> >Accessing contents with an index > 0 is undefined behavior;
|> >with any implementation of reasonable quality, it will cause a bounds
|> >check error.  (Of course, I've never seen an implementation of C with
|> >reasonable quality, so you may be safe.  Although I seem to have heard
|> >somewhere that there is one, Centerline, maybe?)

|> That's right; CenterLine has a C/C++ interpreter that checks all sorts of
|> things, including array and pointer bounds.  There are other products,
|> notably Purify and similar things, that catch some but not all bounds
|> violations in these awful, unsafe languages.

|> My fantasy: what if 10% of the effort spent in developing these C/C++
|> error-checking tools (I've written major chunks of several of them) were spent
|> switching programmers over to a good, safe language like Modula-3 or Dylan?

Commercially, it doesn't pay.  You sell them a good Modula-3 compiler,
maybe a little training, and they go off and write applications that
work, and you never see them again.  With C++, on the other hand, you
can be sure that they will keep coming back for more tools to try an
manage the complexity and the risk.  Which, of course, you will gladly
sell them.
-- 
James Kanze         Tel.: (+33) 88 14 49 00        email: ·····@gabi-soft.fr
GABI Software, Sarl., 8 rue des Francs-Bourgeois, F-67000 Strasbourg, France
Conseils en informatique industrielle --
                              -- Beratung in industrieller Datenverarbeitung