From: Jim Newton
Subject: efficiently generating hash keys
Date: 
Message-ID: <2vndemF2o6lj6U1@uni-berlin.de>
I am maintaining several hash tables for which the keys are
lists of small integers.  I fear that the hash tables might
grow to a size of about 10000, but i'm not quite sure yet.

A typical key looks like this.
  (5 0 4 0 2 2 1 1 6 4 1 2 4 1 5 10 6 10 7 9 1 2 1 9 1 2 5 8)

I know for each of the positions what the maximum value is.
For some positions the maximum value is 11, for other positions
the only possible values are 1 or 2.

My question is should i use the list itself as a hash key or
should I generate an integer by sort of a power series

( dM + nM*(... n5*(d4 + n4*(d3 + n3*(d2 + (n2*d1)))

where nM ... n2 represent the maximum values for each
position, and dM ... d1 represent the values in the
particular key.  I could even make the equation a bit simpler
by taking the maximum n which is 11 in my example.

( dM + ... N*(d4 + N*(d3 + N*(d2 + (N*d1)))

This equation is somewhat similar to representing numbers
in a base system.  base 10 for example.  the number 2354
is simply 4 + 10*(5 + 10*(3 + 10*2))

This key would be expensive to calculate but then hash lookups
could be faster integer lookups.  And perhaps the compiler
would be able to optimize such a combination of
additions and multiplications very well.

The question is how do i decide whether to make the key
generation expensive and the hash lookup easy, or the
key generation easy and the hash lookup difficult?

If someone has some advise or rules of thumb, I'd be
very happy to hear what you think.

From: Rahul Jain
Subject: Re: efficiently generating hash keys
Date: 
Message-ID: <87sm7dpbwb.fsf@nyct.net>
Jim Newton <·····@rdrop.com> writes:

> I am maintaining several hash tables for which the keys are
> lists of small integers.  I fear that the hash tables might
> grow to a size of about 10000, but i'm not quite sure yet.
>
> A typical key looks like this.
>   (5 0 4 0 2 2 1 1 6 4 1 2 4 1 5 10 6 10 7 9 1 2 1 9 1 2 5 8)
[...]
> The question is how do i decide whether to make the key
> generation expensive and the hash lookup easy, or the
> key generation easy and the hash lookup difficult?

Calculating the EQUAL hash-code for this list is going to be similarly
expensive, as is comparing various keys using EQUAL. If you know those
lists are going to be constant (as hash-keys should be), you should
pre-compute a unique hash-key as you propose, IMHO. Bignums should be no
worse and potentially better than arrays or lists of fixnums where the
values from sequential numbers in your key are packed together.

For performance reasons, I would start by adding the values together
starting least significant "digit". This avoids bignum-bignum and
bignum-fixnum adds as much as possible. As a further optimization, DPB
may also be faster than addition, as the compiler won't be as tempted to
construct an intermediate addend and can write directly to the
appropriate location in the bignum's data vector, possibly wasting a few
bits because your "base" will be a power of 2 instead of the exact upper
bound.

Be careful, however, that you use the same "base" for all the values in
a single hash-table or else you might get accidental collisions. E.g.,
(1 1 1) encoded as base 2 is 7 and (1 2) encoded as base 3 is also 7!

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist
From: Jim Newton
Subject: Re: efficiently generating hash keys
Date: 
Message-ID: <2vnm65F2lo385U1@uni-berlin.de>
> Be careful, however, that you use the same "base" for all the values in
> a single hash-table or else you might get accidental collisions. E.g.,
> (1 1 1) encoded as base 2 is 7 and (1 2) encoded as base 3 is also 7!
> 

I do not have any hash tables that need to key on lists of different
lengths and the same maximums apply to all the lists.

So, i don't believe the bases do not have to be the same. But i have
to be careful with the mapping.

A shorter example, if the lists are always 4 elements long
and the maximums are ( 8 3 5 7), then i can generate the key by
choosing the base always = 8+1. But i can also choose a more
dense packing by choosing the bases as a function of the maximums.

( 7 2 5 6 ) -->   7 * (7+1)*(5+1)*(3+1)
                 + 2 * (7+1)*(5+1)
                 + 5 * (7+1)
                 + 6

This type of mapping makes it more likely that the result will be a
fixnum and not a bignum.  However, the calculation is more complicated.
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: efficiently generating hash keys
Date: 
Message-ID: <87u0rtl6t9.fsf@qrnik.zagroda>
Jim Newton <·····@rdrop.com> writes:

> The question is how do i decide whether to make the key
> generation expensive and the hash lookup easy, or the
> key generation easy and the hash lookup difficult?

Try both and measure which is faster.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: Wade Humeniuk
Subject: Re: efficiently generating hash keys
Date: 
Message-ID: <lnDld.98084$VA5.71172@clgrps13>
Jim Newton wrote:

> I am maintaining several hash tables for which the keys are
> lists of small integers.  I fear that the hash tables might
> grow to a size of about 10000, but i'm not quite sure yet.
> 
> A typical key looks like this.
>  (5 0 4 0 2 2 1 1 6 4 1 2 4 1 5 10 6 10 7 9 1 2 1 9 1 2 5 8)
> 
> I know for each of the positions what the maximum value is.
> For some positions the maximum value is 11, for other positions
> the only possible values are 1 or 2.
> 
> My question is should i use the list itself as a hash key or
> should I generate an integer by sort of a power series
> 

I would think neither, make your keys into '(simple-array (unsigned-byte 4) 28).

(make-array 28 :element-type '(integer 0 11) :adjustable nil)

I would think that EQUALP can perform more efficiently on a specifically
typed array than a list.  It can easily check length, iterate by some memory
pointer deep down and the bytes are stored contiguously in memory (and thus
in cache).  Bignums are just like arrays so it even though you
can use eql hash it has to do what equalp does for simple arrays.
Throw in some declarations where the keys are used. Who knows? The compiler
might perform some incredible optimizations.

However... to be sure you have to test them.

Wade