hash tables with a copying memory gc

From: David Betz
Subject: hash tables with a copying memory gc
Date: Thu, 18 Nov 1999 00:00:00 +0000
Message-ID: <Tg3Z3.462$MQ.11584@wbnws01.ne.mediaone.net>

I'm sure this must be obvious but I'm having trouble with it anyway. I have
built a system based on a copying garbage collector and I would like to add
hash table support. In a system where objects never move an obvious choice
for a hash key is the address of the object. Of course, this doesn't work
with a copying collector since objects can potentially move at every
collection. How are hash tables usually implemented in a copying
environment?

One possibility I can think of is to give each object an id that is never
recycled even if the object is collected and use the id as a hash key. This
works but wastes space in every object.

Another possibility is to rehash every time a collection occurs. This wastes
processing time.

What is the good solution that I'm overlooking.

Thanks,
David Betz

Re: hash tables with a copying memory gc Erik Naggum
- Re: hash tables with a copying memory gc David Betz
Re: hash tables with a copying memory gc Rob Warnock
- Re: hash tables with a copying memory gc Duane Rettig
  - Re: hash tables with a copying memory gc Pekka P. Pirinen

From: Erik Naggum
Subject: Re: hash tables with a copying memory gc
Date: Fri, 19 Nov 1999 00:00:00 +0000
Message-ID: <3151988077875646@naggum.no>

* "David Betz" <·····@mediaone.net>
| I'm sure this must be obvious but I'm having trouble with it anyway. I
| have built a system based on a copying garbage collector and I would like
| to add hash table support. In a system where objects never move an
| obvious choice for a hash key is the address of the object. Of course,
| this doesn't work with a copying collector since objects can potentially
| move at every collection. How are hash tables usually implemented in a
| copying environment?

  this applies to an EQ hash table as EQUAL and beyond cannot use the
  address of the object for the obvious reasons.  so for some objects that
  are likely to be stored in an EQ hash table, the typical example being a
  symbol, you store a hash key with the object, but not in all objects.
  for the rest, you have to change the key in the hashtable as you move the
  objects.   this may not be a huge cost, but if you have a generational
  garbage collector, the hashtable itself will at least become old enough
  not to move around all the time, and most of its keys probably will, too.

| What is the good solution that I'm overlooking.

  you _could_ move the keys outside the standard garbage collector's reach
  -- it's not like they're going away unless the hashtable releases them,
  and you pretty much have full control over that.  during the first GC
  after a key had been inserted in or deleted from the table, you would
  take care of this and would then not have to traverse the keys at all
  otherwise.  if allocating memory that doesn't move around is bad for
  other reasons, the solutions I can think of only moves the problems out
  of hash tables and into other similar constructs, such as using indices
  into arrays instead of addresses, which doesn't really help any.  perhaps
  memory that moves around less often (like only when it's full) is a good
  compromise, in which case we're talking about a sort of generational
  garbage collector if not the whole nine yards.

#:Erik
-- 
  Attention Microsoft Shoppers!  MS Monopoly Money 6.0 are now worthless.

From: David Betz
Subject: Re: hash tables with a copying memory gc
Date: Fri, 19 Nov 1999 00:00:00 +0000
Message-ID: <wTdZ3.541$MQ.25828@wbnws01.ne.mediaone.net>

My thanks to both Erik and Rob for their comments. They were very helpful.

From: Rob Warnock
Subject: Re: hash tables with a copying memory gc
Date: Fri, 19 Nov 1999 00:00:00 +0000
Message-ID: <813ama$24o6f@fido.engr.sgi.com>

David Betz <·····@mediaone.net> wrote:
+---------------
| I have built a system based on a copying garbage collector and I would
| like to add hash table support...
+---------------

Me too. (Well, "working on", not "have built"...) The following somewhat
half-baked approach is the best I've come up with so far. It *ought* to
avoid re-hashing for many common uses of hash tables. (I'd be interested
to see how it compares with commercial implementations.)

+---------------
| How are hash tables usually implemented in a copying environment?
| ...give each object an id that is never recycled...  This works but
| wastes space... [or] ...address of the object [and] rehash every time
| a collection occurs. This wastes processing time.
| What is the good solution that I'm overlooking.
+---------------

Those are the two main ways I know of. Well, there's third choice which
works only for "persistent" hashes [see below], and that's to recompute
the hash every time you do a lookup, but that wastes time if keys tend
to be looked up lots of times, on average. The direction I've been heading
is a mixed-strategey hybrid between all of the above, with two main elements: 

First, partition the object/type/class space into objects that have
more-or-less obvious persistent (or idempotent or immutable) hashes
(e.g., strings, symbols, & numbers come to mind) and those that don't
(conses, general vectors, etc,). For pragmatic or heuristic reasons
you might decide to shift *some* (but not necessarily all) objects that
"naturally" fall in the latter group into the first group by providing
them with unique IDs -- perhaps "heavyweight" class or structure instances,
where the extra space for a unique ID might not be as big an issue.

Then, upon garbage collection, only rehash the table if at least one
key that *doesn't* have a persistent hash is currently in the table;
otherwise not. That is:

1. When you create an empty hash table or "clrhash" it, flag it as "pure"
   [or whatever you want to call it]:  (setf (hash-table-pure-p ht) t)

2. Let each (setf (gethash key ht) val) implicitly also do something like:
   (setf (hash-table-pure-p ht) (and (hash-table-pure-p ht)
				     (hash-key-persistent-p key)))
   where the "hash-key-persistent-p" predicate is true if it is possible
   (*and* if you've chosen in your design) to calculate a persistent-style
   hash for that object -- one that is unchanged across collections.

3. Each time a hash table is copied by the GC, if (hash-table-pure-p ht)
   then you don't have to rehash the table, else you do.

4. Whenever the GC rehashes a hash table, it must behave AS IF a "clrhash"
   had been done followed by re-inserting all of the entries, allowing an
   "impure" hash table to become "pure" again if all the impure keys are
   remhash'd. [Note that since hash tables always have an integer number
   of keys in them, you could achieve the "as if" cheaply with a simple
   "ref count" of impure keys maintained by setf/gethash & remhash.]

5. For those objects whose hash values can be persistent, you have a number
   of choices of when to compute the hash:
   A. Always compute it when the object is initially created, and store it
      in the object for later use if the object is ever used as a key of any
      hash table. (Instantiation serial numbers fall here, too.) Costs space
      in every object of every persistent-hash-capable type.
   B. When the object is initially created, store an "impossible" hash in
      the object (also costs space per object), but don't actually compute
      the hash until the first time the object is used as a key in *some*
      hash table (*then* store it in the object). Since most objects are
      never used as hash table keys, this might save considerable CPU time.
      [On the other hand, "intern" probably wants to use the same mechanism,
      so you might as well compute & store the hash for every symbol's
      "symbol-name" at read time, as in "A".]
   C. You might *never* store the hash (saving space), but recompute it
      whenever you need it. [Though obviously, this *can't* work for objects
      using instantiation serial numbers!]

The thing that's interesting to me is that you don't have to apply
a uniform strategy to all objects/types/classes, but can tune the
implementation to give a hybrid strategy that's better than any single
strategy. [You might even expose some of the tuning knobs to the user,
though that might get a bit tricky in places...]


-Rob

p.s. One more point: Whatever you use for a hash function -- *especially*
for objects with instantiation-serial-numbers/IDs -- should have the property
that the information is fairly uniformly distributed throughout the bits
of the largest possible hash value. That allows you to grow a hash table
*without* recomputing the hash values -- just use more bits of the ones that
have already been computed. Raj Jain wrote a paper at DEC that showed that
CRCs (particularly the Ethernet CRC-32) have this property, and for many
years I've been using a table-lookup version of CRC-32 almost everywhere
I needed a hash function. (On some machines it's even cheaper than a
multiply-and-add hash.)

-----
Rob Warnock, 8L-846		····@sgi.com
Applied Networking		http://reality.sgi.com/rpw3/
Silicon Graphics, Inc.		Phone: 650-933-1673
1600 Amphitheatre Pkwy.		FAX: 650-933-0511
Mountain View, CA  94043	PP-ASEL-IA

From: Duane Rettig
Subject: Re: hash tables with a copying memory gc
Date: Fri, 19 Nov 1999 00:00:00 +0000
Message-ID: <4wvre2wai.fsf@beta.franz.com>

····@rigden.engr.sgi.com (Rob Warnock) writes:

> David Betz <·····@mediaone.net> wrote:
> +---------------
> | I have built a system based on a copying garbage collector and I would
> | like to add hash table support...
> +---------------
> 
> Me too. (Well, "working on", not "have built"...) The following somewhat
> half-baked approach is the best I've come up with so far. It *ought* to
> avoid re-hashing for many common uses of hash tables. (I'd be interested
> to see how it compares with commercial implementations.)

OK, you asked for it :-)

Your implementation seems similar enough to that of Allegro CL that
I will describe differences intermixed with your descriptions.

Please note that this whole discussion is only applicable to EQ and
EQL hash tables.  EQUAL and EQUALP hash tables have a different set
of requirements and are thus not affected by the issues of
address-based hashing.

> +---------------
> | How are hash tables usually implemented in a copying environment?
> | ...give each object an id that is never recycled...  This works but
> | wastes space... [or] ...address of the object [and] rehash every time
> | a collection occurs. This wastes processing time.
> | What is the good solution that I'm overlooking.
> +---------------
> 
> Those are the two main ways I know of. Well, there's third choice which
> works only for "persistent" hashes [see below], and that's to recompute
> the hash every time you do a lookup, but that wastes time if keys tend
> to be looked up lots of times, on average. The direction I've been heading
> is a mixed-strategey hybrid between all of the above, with two main elements: 
> 
> First, partition the object/type/class space into objects that have
> more-or-less obvious persistent (or idempotent or immutable) hashes
> (e.g., strings, symbols, & numbers come to mind) and those that don't
> (conses, general vectors, etc,). For pragmatic or heuristic reasons
> you might decide to shift *some* (but not necessarily all) objects that
> "naturally" fall in the latter group into the first group by providing
> them with unique IDs -- perhaps "heavyweight" class or structure instances,
> where the extra space for a unique ID might not be as big an issue.

This seems very similar to what Allegro CL does as an intermediate
strategy (there is a preliminary strategy and also a strategy for
objects that don't fit into the persistent hash category, each of
which I will describe below).  Those objects with "persistent hash"
are NIL, symbols, characters, all numbers, function objects, and
standard-instances (CLOS classes and objects).  We explicitly exclude
strings, because the hash code is calculated from its contents, and
doing a (setf (schar str i) ch) might cause the hash code to change,
thus invalidating the persistency of the hash.

These objects never "dirty" a hash table.  You can put as many of
these objects into an EQ or EQL hash table as you want and never
cause it to rehash due to movement (though a rehash due to size
might occur).

> Then, upon garbage collection, only rehash the table if at least one
> key that *doesn't* have a persistent hash is currently in the table;
> otherwise not. That is:

This statement and others below are leading me to conclude that you
are having the gc do the rehashing, or perhaps forcing a rehash at
the end of each gc.  Just to be sure, let me emphasize that it is
not necessary to rehash a hash-table whenever it is needed; it is
more efficient to do the rehashing lazily, that is, before the first
access after the first gc after the hash-table is dirtued.  Thus,
if a hash table is seldom used, it might go through many hundreds
of gcs before it is actually rehashed.

To this end, here is the promised preliminary strategy: each gc is
numbered, and each hash-table stores a "hash tick" which corresponds
to the number of the gc in which it was last rehashed.  Any access
will do a rehash-dirty check; if the hash tick is the same as the gc
number, there is no reason to rehash the table, because a gc hasn't
occurred yet.  But if the hash tick and the gc number disagree, then
the hash table must be checked for dirtiness; if it is dirty the
rehash occurs and the hash tick in the table is updated to the number
of the current gc.  [A similar and simpler strategy could be arranged
by just having the gc setting a bit, but having the hash-tick allows
us to perform usage analyses].

> 1. When you create an empty hash table or "clrhash" it, flag it as "pure"
>    [or whatever you want to call it]:  (setf (hash-table-pure-p ht) t)

A rehash should also start by setting this "pure-p" boolean in the table.
If a hash table only had one key that made it "impure" ("dirty", in
Allegro CL parlance) and that key were remhashed at some point, the hash
table should be able to become "pure" (or "clean", in Allegro CL) again.

> 2. Let each (setf (gethash key ht) val) implicitly also do something like:
>    (setf (hash-table-pure-p ht) (and (hash-table-pure-p ht)
> 				     (hash-key-persistent-p key)))
>    where the "hash-key-persistent-p" predicate is true if it is possible
>    (*and* if you've chosen in your design) to calculate a persistent-style
>    hash for that object -- one that is unchanged across collections.
> 
> 3. Each time a hash table is copied by the GC, if (hash-table-pure-p ht)
>    then you don't have to rehash the table, else you do.
> 
> 4. Whenever the GC rehashes a hash table, it must behave AS IF a "clrhash"
>    had been done followed by re-inserting all of the entries, allowing an
>    "impure" hash table to become "pure" again if all the impure keys are
>    remhash'd. [Note that since hash tables always have an integer number
>    of keys in them, you could achieve the "as if" cheaply with a simple
>    "ref count" of impure keys maintained by setf/gethash & remhash.]

The above works well for objects of persistent hash, but what can one do
for all of those objects that cannot be persistently hashed?  It seems
a shame to dirty a whole table just because one object might move.

Allegro CL treats an object with non-persistent hash by noting that the
hashes _are_ persistent until the object moves, and thus the object
should only dirty the table if it actually might move.  This approach is
Allegro CL specific, but can probably be applied to any generational gc.

Background:
 - Allegro CL has a notion of "new" and "old" (or "tenured") spaces.
 - Objects in newspace are moved each scavenge (normal generational gc),
   and objects in oldspace are only possibly moved when a global gc
   occurs.
 - The test for newness of an object is very simple; it is a comparison
   to a "fence" that separates newspace and oldspace.

Using these facts, we implement a multi-stage dirtying algortihm.  A hash
table has two booleans instead of one: a rehash-after-scavenge bit, and
a rehash-after-global-gc bit.  The hash-table is clean if both of these
bits are clear.  If a non-persistent-hash object is currently in newspace,
a (setf gethash) of that object as a key will cause the
rehash-after-scavenge bit to be forced, and if the object is in oldspace,
the rehash-after-global-gc bit is forced.

So the gc (either scavenge or global-gc) work with the rehash-dirty check
to determine if the hash-table must really be rehashed after a scavenge,
even if it is marked as dirty after a global-gc.

What does this imply?  If you fill up a hash table and then cause all of
the objects that are used as keys to be tenured (made old), and if you
arrange to never cause a global-gc, then your hash-table will never need
rehashing, even if it hash non-persistent-hash objects as keys.

> 5. For those objects whose hash values can be persistent, you have a number
>    of choices of when to compute the hash:
>    A. Always compute it when the object is initially created, and store it
>       in the object for later use if the object is ever used as a key of any
>       hash table. (Instantiation serial numbers fall here, too.) Costs space
>       in every object of every persistent-hash-capable type.

We do this for symbols, function objects, and standard instances.  Symbols'
hash codes are generated by way of their names.  The others each have a
simple counter that gets incremented.

>    B. When the object is initially created, store an "impossible" hash in
>       the object (also costs space per object), but don't actually compute
>       the hash until the first time the object is used as a key in *some*
>       hash table (*then* store it in the object). Since most objects are
>       never used as hash table keys, this might save considerable CPU time.
>       [On the other hand, "intern" probably wants to use the same mechanism,
>       so you might as well compute & store the hash for every symbol's
>       "symbol-name" at read time, as in "A".]

We don't use this technique.  Most symbols are interned, automatically causing
a need for a hash code to be generated.  And our simple incremental hash-code
for function objects and standard-instances make it not a big deal to just
stuff the new code at object-creation time.

>    C. You might *never* store the hash (saving space), but recompute it
>       whenever you need it. [Though obviously, this *can't* work for objects
>       using instantiation serial numbers!]

This is always done for objects like numbers.

> The thing that's interesting to me is that you don't have to apply
> a uniform strategy to all objects/types/classes, but can tune the
> implementation to give a hybrid strategy that's better than any single
> strategy. [You might even expose some of the tuning knobs to the user,
> though that might get a bit tricky in places...]

Well, we do allow user defined hash-code generation and tests, although
such user tweaking does take the hash-table out of the realm of the EQ
or EQL hash table strategy.

> p.s. One more point: Whatever you use for a hash function -- *especially*
> for objects with instantiation-serial-numbers/IDs -- should have the property
> that the information is fairly uniformly distributed throughout the bits
> of the largest possible hash value. That allows you to grow a hash table
> *without* recomputing the hash values -- just use more bits of the ones that
> have already been computed. Raj Jain wrote a paper at DEC that showed that
> CRCs (particularly the Ethernet CRC-32) have this property, and for many
> years I've been using a table-lookup version of CRC-32 almost everywhere
> I needed a hash function. (On some machines it's even cheaper than a
> multiply-and-add hash.)

I'm not sure I understand the point of this paragraph; if a persistent hash
object is storing its hash code, then "computing" the hash is nothing more
than grabbing it out of the object.  If it is a generated hash code, then
you must recompute it anyway, before hashing it in whatever way you do
(either by truncation or modulo arithmetic) to fit into the size of the
hash table.  What am I missing here?

-- 
Duane Rettig          Franz Inc.            http://www.franz.com/ (www)
1995 University Ave Suite 275  Berkeley, CA 94704
Phone: (510) 548-3600; FAX: (510) 548-8253   ·····@Franz.COM (internet)

From: Pekka P. Pirinen
Subject: Re: hash tables with a copying memory gc
Date: Tue, 23 Nov 1999 00:00:00 +0000
Message-ID: <ixd7t1kwi1.fsf@gaspode.cam.harlequin.co.uk>

Duane Rettig <·····@franz.com> writes:
> ····@rigden.engr.sgi.com (Rob Warnock) writes:
> > First, partition the object/type/class space into objects that have
> > more-or-less obvious persistent (or idempotent or immutable) hashes
> > (e.g., strings, symbols, & numbers come to mind) and those that don't
> [...]
> it is not necessary to rehash a hash-table whenever it is needed; it is
> more efficient to do the rehashing lazily, that is, before the first
> access after the first gc after the hash-table is dirtued.
> [...]
> here is the promised preliminary strategy: each gc is numbered, and
> each hash-table stores a "hash tick" which corresponds to the number of
> the gc in which it was last rehashed.  Any access will do a rehash-
> dirty check; if the hash tick is the same as the gc number, there is no
> reason to rehash the table, because a gc hasn't occurred yet.

Here's another technique to avoid even more rehashes: do the access
_first_, and do the rehashing check only if the access fails to find
an entry.  Then, if the table needed to be rehashed, the access must
be redone, since the entry might be there after all, only it was in
the slot computed from its old address.  The cost of redoing the
access is small compared to the rehashing cost.

This means the table doesn't get rehashed if you're only accessing
existing elements with persistent hashes.  Additionally, if the hash
table has a high hit rate, you're paying very little for the rehashing
checks.

> Allegro CL treats an object with non-persistent hash by noting that the
> hashes _are_ persistent until the object moves, and thus the object
> should only dirty the table if it actually might move. [details
> omitted] A hash table has two booleans instead of one: a
> rehash-after-scavenge bit, and a rehash-after-global-gc bit.  The
> hash-table is clean if both of these bits are clear.

Harlequin Dylan (now called Functional Developer
<URL:http://www.functionalobjects.com/>) hash tables had a similar
scheme, but with more bits, roughly corresponding to generations.
It's important to use divisions that are cheap to compute during
GETHASH and SETF GETHASH.

> What does this imply?  If you fill up a hash table and then cause all of
> the objects that are used as keys to be tenured (made old), and if you
> arrange to never cause a global-gc, then your hash-table will never need
> rehashing, even if it hash non-persistent-hash objects as keys.

Note that the access-first technique can have much the same effect on
old, unchanging hash tables: since the addresses never (rarely)
change, the accesses (almost) always succeed, as long the program is
looking up something that actually is in the table.
-- 
Pekka P. Pirinen
Adaptive Memory Management Group, Harlequin Limited
If you don't succeed at first, try again.  Then quit.  No use of being a
damn fool about it.  - W. C. Fields