From: Adam Warner
Subject: sxhash redux
Date: 
Message-ID: <pan.2005.05.31.02.17.43.264581@consulting.net.nz>
Hi all,

[Is the "SX" in "SXHASH" an abbreviation?]

Now that I have successfully implemented custom hash tables I have a
better appreciation that SXHASH should always produce hash codes that are
well distributed within the range of each platform's non-negative fixnums.

If CLISP limits its 64-bit version of SXHASH to the range of its 32-bit
non-negative fixnums then it is worthless using the hash code as a index
to any vector greater than 16777215 in size.

This is a serious limitation upon the performance of any hash table with
many millions of elements and a practically unacceptable limitation upon
64-bit platforms.

Say you have (expt 2 31) elements in a hash table with this hash code
limitation that still results in a beautifully uniform distribution of
hash codes over the more limited range. A hash table's primary vector of
size 16777215 would then contain approximately 128 elements in each
reference. You could search up to 128 elements before finding the one that
matches the equality predicate. If the predicate is expensive it may be
quite apparent that key lookup is no longer O(1).

It's usually impractical to replace CL:SXHASH with a custom version due to
a speed penalty. This is particularly acute with CLISP. It is however
practical for any end-user to compile different FASL files for 32-bit and
64-bit versions of CLISP when explicitly or implicitly saving hash code
literals in a FASL file (or in general, whenever any fixnum declaration or
array reference size on one platform conflicts with the other platform).

Thus I now recommend the implementation of the full range of non-negative
fixnum hash codes for all platforms regardless of whether they are
intended to share the same FASL files. There are many reasons why
implementations with different fixnum ranges cannot always share the same
FASL files even though it will work a lot of the time. Users that have to
deal with the incompatibilities can assert a size for most-positive-fixnum
in every FASL file to ensure the wrong FASL file is never accidentally
loaded:

(eval-when (:compile-toplevel :load-toplevel :execute)
  (assert (= #.most-positive-fixnum (load-time-value most-positive-fixnum))))

Regards,
Adam

From: Christopher C. Stacy
Subject: Re: sxhash redux
Date: 
Message-ID: <u64x0110v.fsf@news.dtpq.com>
Adam Warner <······@consulting.net.nz> writes:
> Is the "SX" in "SXHASH" an abbreviation?

S-eXpression Hash
From: Bruno Haible
Subject: Re: sxhash redux
Date: 
Message-ID: <d7hj7b$mt5$1@laposte.ilog.fr>
Adam Warner wrote:
>
> If CLISP limits its 64-bit version of SXHASH to the range of its 32-bit
> non-negative fixnums then it is worthless using the hash code as a index
> to any vector greater than 16777215 in size.
>
> This is a serious limitation upon the performance of any hash table with
> many millions of elements and a practically unacceptable limitation upon
> 64-bit platforms.

The clisp developers agree. This has been fixed in the clisp CVS a week ago
for normal hash tables, and six months ago for package symbol-tables.

              Bruno
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: sxhash redux
Date: 
Message-ID: <87fyw3wer5.fsf@qrnik.zagroda>
Bruno Haible <·····@clisp.org> writes:

>> If CLISP limits its 64-bit version of SXHASH to the range of its 32-bit
>> non-negative fixnums then it is worthless using the hash code as a index
>> to any vector greater than 16777215 in size.
>>
>> This is a serious limitation upon the performance of any hash table with
>> many millions of elements and a practically unacceptable limitation upon
>> 64-bit platforms.
>
> The clisp developers agree. This has been fixed in the clisp CVS a week ago
> for normal hash tables, and six months ago for package symbol-tables.

Do you think that letting a hash go up to 0x3FFFFFFF (1e9) on 64-bit
platforms is enough, or it's better to have a bigger range in this
case? In the implementation I have in mind object sizes are 64-bit on
64-bit platforms, but I'm not sure whether someone will want to make
a hash table with 1e9 entries (of course it will work anyway but
collisions will be often).

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: Rob Warnock
Subject: Re: sxhash redux
Date: 
Message-ID: <Q5ydnX7R__wrDz7fRVn-1A@speakeasy.net>
Marcin 'Qrczak' Kowalczyk  <······@knm.org.pl> wrote:
+---------------
| Bruno Haible <·····@clisp.org> writes:
| > The clisp developers agree. This has been fixed in the clisp CVS a week ago
| > for normal hash tables, and six months ago for package symbol-tables.
| 
| Do you think that letting a hash go up to 0x3FFFFFFF (1e9) on 64-bit
| platforms is enough, or it's better to have a bigger range in this
| case? In the implementation I have in mind object sizes are 64-bit on
| 64-bit platforms, but I'm not sure whether someone will want to make
| a hash table with 1e9 entries (of course it will work anyway but
| collisions will be often).
+---------------

A well-designed hash has (roughly) equal information in each bit,
therefore for smaller hash tables only a subset of bits need be used
for indexing into the table. The remaining bits can be useful as a
secondary hash in resolving collisions. And if one saves the *whole*
original SXHASH value with each object in the hash table, then it
is not necessary to rehash to grow the table -- just start using a
different subset of the hash bits.

Which is a long-winded way of saying that on a 64-bit machine it'd
be really nice if SXHASH values were at least 60 (say) bits...


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607