Hashing arrays

From: Waldek Hebisch
Subject: Hashing arrays
Date: Tue, 21 Apr 2009 19:01:11 +0000
Message-ID: <gsl55n$c41$1@z-news.wcss.wroc.pl>

I have noticed that in all Lisp implementations that I could
try (clisp, Closure CL, ecl, gcl, Poplog Clisp, sbcl) value
of sxhash applied to an array apparently does not depend on
content of array.  In gcl and ecl I got different values
for differet arrays, in other implementations I get the
same value for arrays of the same size.

Below is a simple test I tried:

[1]> (setf ar (make-array 10 :initial-element nil))
#(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL)
[2]> (sxhash ar)
1322060457
[3]> (setf ar2 (make-array 10 :initial-element nil))
#(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL)
[4]> (sxhash ar2)
1322060457
[5]> (setf (aref ar 0) 667)
667
[6]> (sxhash ar)
1322060457
[7]> (setf (aref ar 1) 17)
17
[8]> (sxhash ar)
1322060457
[9]> 


This behaviour makes sxhash useless for hashing arrays, so IMHO
is a bug.  The question is if this behaviour is mandated by
the standard, or just several implementations independently
have the same bug?

BTW: In the past I have noticed very poor performance using
EQUALP hash table with keys beeing arrays.  If hash table
used sxhash as hash function, this can explain slowness...

-- 
                              Waldek Hebisch
·······@math.uni.wroc.pl

Re: Hashing arrays Tim Bradshaw
- Re: Hashing arrays Raffael Cavallaro
  - Re: Hashing arrays Madhu
  - Re: Hashing arrays Tim Bradshaw
Re: Hashing arrays Tamas K Papp
- Re: Hashing arrays Marcus Breiing
Re: Hashing arrays Stanisław Halik
Re: Hashing arrays Martin Rubey
- Re: Hashing arrays Raffael Cavallaro
- Re: Hashing arrays Thomas A. Russ
  - Re: Hashing arrays Tamas K Papp
- Re: Hashing arrays Kaz Kylheku
  - Re: Hashing arrays Thomas A. Russ
Re: Hashing arrays Madhu
- Re: Hashing arrays ··········@gmail.com
  - Re: Hashing arrays Thomas F. Burdick
  - Re: Hashing arrays Madhu
    - Re: Hashing arrays gugamilare
      - Re: Hashing arrays Madhu
        Re: Hashing arrays gugamilare
        Re: Hashing arrays Thomas A. Russ
        Re: Hashing arrays Madhu
  - Re: Hashing arrays Thomas A. Russ
    - Re: Hashing arrays ·····@sherbrookeconsulting.com
    - Re: Hashing arrays gugamilare
      - Re: Hashing arrays Thomas A. Russ
  - Re: Hashing arrays Waldek Hebisch
    - Re: Hashing arrays Tamas K Papp
      - Re: Hashing arrays Thomas A. Russ
        Re: Hashing arrays Kaz Kylheku
- Re: Hashing arrays ······@nhplace.com

From: Tim Bradshaw
Subject: Re: Hashing arrays
Date: Tue, 21 Apr 2009 21:19:05 +0000
Message-ID: <c93363f2-cf36-4ac4-9f05-bc092163898a@t11g2000vbc.googlegroups.com>

On Apr 21, 8:01 pm, Waldek Hebisch <·······@math.uni.wroc.pl> wrote:

>
> This behaviour makes sxhash useless for hashing arrays, so IMHO
> is a bug.  The question is if this behaviour is mandated by
> the standard, or just several implementations independently
> have the same bug?

One question to ask is: what should SXHASH base its computation on for
an array?  Basing it on the contents, for instance, means it changes
if the contents change.  That would mean that it could not be used for
EQ, EQL or EQUAL hashtables, though I think it would serve for EQUALP
hashtables, so long as it did not encode identity as well.  You could
base it on identity somehow, which would make it useful for EQ, EQL, &
EQUAL tables but not for EQUALP ones, since they can not encode
identity.

So, if you want to use SXHASH in the implementation of the standard CL
hashtables it can not encode identity *or* content for arrays.  It
also can't encode element type.  This leaves essentially some function
of the dimensions of the array.

From: Raffael Cavallaro
Subject: Re: Hashing arrays
Date: Tue, 21 Apr 2009 22:07:52 +0000
Message-ID: <109f71c1-9b04-4eb5-85ad-29b87d3a8a37@u8g2000yqn.googlegroups.com>

On Apr 21, 5:19 pm, Tim Bradshaw <··········@tfeb.org> wrote:
> On Apr 21, 8:01 pm, Waldek Hebisch <·······@math.uni.wroc.pl> wrote:
>
>
>
> > This behaviour makes sxhash useless for hashing arrays, so IMHO
> > is a bug.  The question is if this behaviour is mandated by
> > the standard, or just several implementations independently
> > have the same bug?
>
> One question to ask is: what should SXHASH base its computation on for
> an array?  Basing it on the contents, for instance, means it changes
> if the contents change.  That would mean that it could not be used for
> EQ, EQL or EQUAL hashtables, though I think it would serve for EQUALP
> hashtables, so long as it did not encode identity as well.  You could
> base it on identity somehow, which would make it useful for EQ, EQL, &
> EQUAL tables but not for EQUALP ones, since they can not encode
> identity.
>
> So, if you want to use SXHASH in the implementation of the standard CL
> hashtables it can not encode identity *or* content for arrays.  It
> also can't encode element type.  This leaves essentially some function
> of the dimensions of the array.

"For objects of types that equal compares with eq, item 3 requires
that the hash-code be based on some immutable quality of the identity
of the object."

So except for bit-vectors and strings, sxhash for arrays must be
identity based, no?

From: Madhu
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 04:14:08 +0000
Message-ID: <m3r5zl9vfz.fsf@moon.robolove.meer.net>

* Raffael Cavallaro
in Article <····································@u8g2000yqn.googlegroups.com>
Wrote on Tue, 21 Apr 2009 15:07:52 -0700 (PDT):

[This article (and many articles in this topic) are missing from
 the news.motzarella.com nntp server.  Perhaps someone knows why?]

| On Apr 21, 5:19�pm, Tim Bradshaw

|> One question to ask is: what should SXHASH base its computation on for
|> an array? �Basing it on the contents, for instance, means it changes
|> if the contents change. �That would mean that it could not be used for
|> EQ, EQL or EQUAL hashtables, though I think it would serve for EQUALP
|> hashtables, so long as it did not encode identity as well. �You could
|> base it on identity somehow, which would make it useful for EQ, EQL, &
|> EQUAL tables but not for EQUALP ones, since they can not encode
|> identity.
|
|> So, if you want to use SXHASH in the implementation of the standard CL
|> hashtables it can not encode identity *or* content for arrays. �It
|> also can't encode element type. �This leaves essentially some function
|> of the dimensions of the array.
|
| "For objects of types that equal compares with eq, item 3 requires
| that the hash-code be based on some immutable quality of the identity
| of the object."

| So except for bit-vectors and strings, sxhash for arrays must be
| identity based, no?

I think This `note' is of dubious value in the specification.  It
concerns itself with showing that that SXHASH _can_ be identity based
and it can be thus because it does not restrict SXHASH from producing
the same code for an object in a lisp session, and the `externalized
version' (say in the fasl file) of the same object.

But Tim Bradshaw's arguments you cited above explain why SXHASH should
_not_ be based on identity and overrule this suggestion in most
implementations

--
Madhu

From: Tim Bradshaw
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 16:56:53 +0000
Message-ID: <acb2482a-d9a6-49e8-810b-9dc1d569edca@k38g2000yqh.googlegroups.com>

On Apr 21, 11:07 pm, Raffael Cavallaro the object."
>
> So except for bit-vectors and strings, sxhash for arrays must be
> identity based, no?

No.  For any two arrays which are EQ, then SXHASH must generate the
same number.  It may *also* generate the same number for arrays which
are *not* EQ, but this is allowed.

From: Tamas K Papp
Subject: Re: Hashing arrays
Date: Tue, 21 Apr 2009 19:10:20 +0000
Message-ID: <756k0rF170e4fU1@mid.individual.net>

On Tue, 21 Apr 2009 19:01:11 +0000, Waldek Hebisch wrote:

> I have noticed that in all Lisp implementations that I could try (clisp,
> Closure CL, ecl, gcl, Poplog Clisp, sbcl) value of sxhash applied to an
> array apparently does not depend on content of array.  In gcl and ecl I
> got different values for differet arrays, in other implementations I get
> the same value for arrays of the same size.
> 
> Below is a simple test I tried:
> 
> [1]> (setf ar (make-array 10 :initial-element nil)) #(NIL NIL NIL NIL
> NIL NIL NIL NIL NIL NIL) [2]> (sxhash ar)
> 1322060457
> [3]> (setf ar2 (make-array 10 :initial-element nil)) #(NIL NIL NIL NIL
> NIL NIL NIL NIL NIL NIL) [4]> (sxhash ar2)
> 1322060457
> [5]> (setf (aref ar 0) 667)
> 667
> [6]> (sxhash ar)
> 1322060457
> [7]> (setf (aref ar 1) 17)
> 17
> [8]> (sxhash ar)
> 1322060457
> [9]>
> 
> 
> This behaviour makes sxhash useless for hashing arrays, so IMHO is a
> bug.  The question is if this behaviour is mandated by the standard, or
> just several implementations independently have the same bug?
> 
> BTW: In the past I have noticed very poor performance using EQUALP hash
> table with keys beeing arrays.  If hash table used sxhash as hash
> function, this can explain slowness...

I notice something similar in SBCL.  The Hyperspec page for sxhash
says:

4. The hash-code is intended for hashing. This places no verifiable
constraint on a conforming implementation, but the intent is that an
implementation should make a good-faith effort to produce hash-codes
that are well distributed within the range of non-negative fixnums.

So, it this is not strictly a bug, it could qualify as a nuisance.
Try contacting the authors of your favorite implementation about it,
and let us know what they said.

Tamas

From: Marcus Breiing
Subject: Re: Hashing arrays
Date: Tue, 21 Apr 2009 21:07:32 +0000
Message-ID: <huag1oq1z31i5@breiing.com>

* Tamas K Papp

> I notice something similar in SBCL.  The Hyperspec page for sxhash
> says:
> 4. ...

The behavior follows from "3. The hash-code for an object is always
the same within a single session provided that the object is not
visibly modified with regard to the equivalence test EQUAL".

Because EQUAL is EQ for arrays, SXHASH can't depend on array contents.
A lucky implementation may use an array's memory address if that never
changes. I guess that's what gcl and ecl do. Others have copying GCs
and must fall back on, say, LENGTH for non-adjustable arrays.


-- 
Marcus

From: Stanisław Halik
Subject: Re: Hashing arrays
Date: Tue, 21 Apr 2009 21:04:36 +0000
Message-ID: <gslcd4$18qh$1@opal.icpnet.pl>

thus spoke Waldek Hebisch <·······@math.uni.wroc.pl>:

> I have noticed that in all Lisp implementations that I could
> try (clisp, Closure CL, ecl, gcl, Poplog Clisp, sbcl) value
> of sxhash applied to an array apparently does not depend on
> content of array.

SXHASH provides EQUAL hashing semantics. Only strings are compared by
their contents.

From: Martin Rubey
Subject: Re: Hashing arrays
Date: Tue, 21 Apr 2009 22:21:25 +0000
Message-ID: <d98wltr6l6.fsf@ada0.ifam.uni-hannover.de>

Waldek Hebisch <·······@math.uni.wroc.pl> writes:

> Below is a simple test I tried:
>
> [1]> (setf ar (make-array 10 :initial-element nil))
> #(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL)
> [2]> (sxhash ar)
> 1322060457
> [3]> (setf ar2 (make-array 10 :initial-element nil))
> #(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL)
> [4]> (sxhash ar2)
> 1322060457

Raffael Cavallaro <················@gmail.com> writes:

> IOW, common lisp arrays do not change their identity when you modify
> their contents - they are the same lisp object (i.e., under #'eq)
> before and after mutation. Lisp arrays are mutable data structures
> that maintain identity under mutation, much as a person maintains his
> identity even if he loses his hair. It is therefore to be expected
> that sxhash should return the same value for a particular array before
> and after its contents are modified.

But in the test above, Waldek computed the hash of two different arrays,
didn't he?

Martin

From: Raffael Cavallaro
Subject: Re: Hashing arrays
Date: Tue, 21 Apr 2009 22:26:35 +0000
Message-ID: <b6b0406c-3aac-4fcd-8d36-c66d79f60130@q16g2000yqg.googlegroups.com>

On Apr 21, 6:21 pm, Martin Rubey <········@yahoo.de> wrote:
> Waldek Hebisch <·······@math.uni.wroc.pl> writes:
> > Below is a simple test I tried:
>
> > [1]> (setf ar (make-array 10 :initial-element nil))
> > #(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL)
> > [2]> (sxhash ar)
> > 1322060457
> > [3]> (setf ar2 (make-array 10 :initial-element nil))
> > #(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL)
> > [4]> (sxhash ar2)
> > 1322060457
> Raffael Cavallaro <················@gmail.com> writes:
> > IOW, common lisp arrays do not change their identity when you modify
> > their contents - they are the same lisp object (i.e., under #'eq)
> > before and after mutation. Lisp arrays are mutable data structures
> > that maintain identity under mutation, much as a person maintains his
> > identity even if he loses his hair. It is therefore to be expected
> > that sxhash should return the same value for a particular array before
> > and after its contents are modified.
>
> But in the test above, Waldek computed the hash of two different arrays,
> didn't he?
>
> Martin

Sorry for the confusion - my post which you quote was a confused
muddle which is why I deleted it.

From: Thomas A. Russ
Subject: Re: Hashing arrays
Date: Tue, 21 Apr 2009 23:12:28 +0000
Message-ID: <ymimya9mwir.fsf@blackcat.isi.edu>

Martin Rubey <········@yahoo.de> writes:

> Waldek Hebisch <·······@math.uni.wroc.pl> writes:
> 
> > Below is a simple test I tried:
> >
> > [1]> (setf ar (make-array 10 :initial-element nil))
> > #(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL)
> > [2]> (sxhash ar)
> > 1322060457
> > [3]> (setf ar2 (make-array 10 :initial-element nil))
> > #(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL)
> > [4]> (sxhash ar2)
> > 1322060457

> But in the test above, Waldek computed the hash of two different arrays,
> didn't he?

Yes, they are two different arrays.

But the question is how one would go about implementing an appropriate
SXHASH function that works for all of the equality test types that it
needs to work for.

For EQ, EQL or EQUAL hash tables, one could use identity as part of the
hash computation algorithm.  Now, for lisp implementations with copying
garbage collectors, this would imply having some means of identifying
an array that doesn't depend on its memory location.  One possible
method would be to compute a hash code at the time the array is created
and just store it.  That would, naturally, increase the memory footprint
of all arrays, and so has its own costs.

But it gets worse when you consider EQUALP hash tables, which need to be
able to hash to arrays which are considered EQUALP to the same hash
code.  So that seems (as others have pointed out) to rule out any hash
algorithm that depends on the identity of the array.

And computing a hash code based on the mutable contents also has its
problems.  That is because doing anything to the array that would change
its hash code is likely to make any hash table entries for that array
unfindable, since the hash code used to look it up has now changed.

So, it isn't too surprising that there are a limited set of features of
an array that are available for use in a hashing algorithm.  They are
limited to chose parts of the array that are immutable, which really is
things like the dimensions of the array.  I'm not sure whether upgraded
element type can be considered.  I would have to look at the details of
the EQUALP test to see.

I don't even want to think about what the situation is like for
adjustable arrays....

So, the bottom line is that such an SXHASH function is not buggy, but is
highly constrained by the requirements of the standard.

-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: Tamas K Papp
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 00:56:28 +0000
Message-ID: <75789sF16ltbcU1@mid.individual.net>

On Tue, 21 Apr 2009 16:12:28 -0700, Thomas A. Russ wrote:

> I don't even want to think about what the situation is like for
> adjustable arrays....
> 
> So, the bottom line is that such an SXHASH function is not buggy, but is
> highly constrained by the requirements of the standard.

Thank you for explaining this, now I see why implementing this for
arrays is not that easy, and the designers of CL made a wise choice
when they set up SXHASH as they did.

Tamas

From: Kaz Kylheku
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 00:49:01 +0000
Message-ID: <20090501221018.705@gmail.com>

On 2009-04-21, Martin Rubey <········@yahoo.de> wrote:
> Waldek Hebisch <·······@math.uni.wroc.pl> writes:
>
>> Below is a simple test I tried:
>>
>> [1]> (setf ar (make-array 10 :initial-element nil))
>> #(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL)
>> [2]> (sxhash ar)
>> 1322060457
>> [3]> (setf ar2 (make-array 10 :initial-element nil))
>> #(NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL)
>> [4]> (sxhash ar2)
>> 1322060457
>
> Raffael Cavallaro <················@gmail.com> writes:
>
>> IOW, common lisp arrays do not change their identity when you modify
>> their contents - they are the same lisp object (i.e., under #'eq)
>> before and after mutation. Lisp arrays are mutable data structures
>> that maintain identity under mutation, much as a person maintains his
>> identity even if he loses his hair. It is therefore to be expected
>> that sxhash should return the same value for a particular array before
>> and after its contents are modified.
>
> But in the test above, Waldek computed the hash of two different arrays,
> didn't he?

Yes, but that still doesn't prove anything; hashing functions can exhibit
legitimate collisions which are not degenerate behavior.

Suppose that the EQ hashing function takes bits from the address, but only to
the granularity of a page. Then all objects whose base address is in the same
page get hashed to the same value. That would be an acceptable function for
purposes of hashing. The number of objects that will collide due to this is
strictly bounded since only so many objects can fit into a page. 

If you allocate two small arrays one after another, they are likely to be
collocated on the same page.

You either have to look at the code of sxhash, or construct a better empirical
test: show that a large number of such arrays all hash to the same value, not
only two.

From: Thomas A. Russ
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 01:02:52 +0000
Message-ID: <ymiiqkxmrer.fsf@blackcat.isi.edu>

Kaz Kylheku <········@gmail.com> writes:

> You either have to look at the code of sxhash, or construct a better empirical
> test: show that a large number of such arrays all hash to the same value, not
> only two.

Looking at the code would be the best solution.

But one could imagine that one possible implementation would be to
compute a hash value for a non-adjustable array by simply summing the
size of the array dimensions, each multiplied by some moderate-sized
prime number and then normalized to be in the fixnum range.

With such a hash function, all arrays of the same dimension would
collide, but it would at least fulfill the needed constraints.

OK, what happens when one array is displaced to another?  Do they need
to have the same SXHASH?

-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: Madhu
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 02:13:23 +0000
Message-ID: <m3ws9dpha4.fsf@moon.robolove.meer.net>

* Waldek Hebisch <············@z-news.wcss.wroc.pl> :
Wrote on Tue, 21 Apr 2009 19:01:11 +0000 (UTC):

| I have noticed that in all Lisp implementations that I could
| try (clisp, Closure CL, ecl, gcl, Poplog Clisp, sbcl) value
| of sxhash applied to an array apparently does not depend on
| content of array.  In gcl and ecl I got different values
| for differet arrays, in other implementations I get the
| same value for arrays of the same size.
|
| Below is a simple test I tried:

[]

| This behaviour makes sxhash useless for hashing arrays, so IMHO
| is a bug.  The question is if this behaviour is mandated by
| the standard, or just several implementations independently
| have the same bug?

The bug may just be in your expectation of what SXHASH can do. In CL
arrays are mutable.  The mutability of arrays has consequences for EQUAL
hash tables --- The same array with a different contents should produce
a different key.  Now EQUAL hash tables should consider contents anyway
so SXHASH can at best compute a bucket where all elements have to be
compared individually.

But still, all is not well.

There is, I think an important deficiency in the the GETHASH
specification.  It does not mention anything about the copying (or
non-copying) of keys.  While this is not an issue in dumbed down
languages where all objects are made immutable to the programmer, again,
in CL arrays are mutable, and if you are passing an adjustable array
(say) to puthash (i.e. (SETF GETHASH)) without making a copy, you may be
in for "surprises".

[I thought I'd posted about this asking for KMP's clarification back in
 2004, but I realize now it never made it to usenet.  Also my private
 mail to KMP on some clarifications seems to have been ignored]

--
Madhu

From: ··········@gmail.com
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 04:11:15 +0000
Message-ID: <0dd7ae3b-7a8f-48d7-87fc-b360611ffe67@u8g2000yqn.googlegroups.com>

On 21 abr, 23:13, Madhu <·······@meer.net> wrote:
> * Waldek Hebisch <············@z-news.wcss.wroc.pl> :
> Wrote on Tue, 21 Apr 2009 19:01:11 +0000 (UTC):
>
> | I have noticed that in all Lisp implementations that I could
> | try (clisp, Closure CL, ecl, gcl, Poplog Clisp, sbcl) value
> | of sxhash applied to an array apparently does not depend on
> | content of array.  In gcl and ecl I got different values
> | for differet arrays, in other implementations I get the
> | same value for arrays of the same size.
> |
> | Below is a simple test I tried:
>
> []
>
> | This behaviour makes sxhash useless for hashing arrays, so IMHO
> | is a bug.  The question is if this behaviour is mandated by
> | the standard, or just several implementations independently
> | have the same bug?
>
> The bug may just be in your expectation of what SXHASH can do. In CL
> arrays are mutable.  The mutability of arrays has consequences for EQUAL
> hash tables --- The same array with a different contents should produce
> a different key.  Now EQUAL hash tables should consider contents anyway
> so SXHASH can at best compute a bucket where all elements have to be
> compared individually.

I think you (and a few other people) are mistaking the behaviour of
the equal test for arrays.

[1]> (equal #(1 2 3) #(1 2 3))
NIL

Two arrays are equal iff they are eq (except for strings and bit
vectors). And sxhash is meant to work with equal. This means that
hashing the same array with different contents (e.g. hashing before
and after you modify it) should produce the _same_ result, as the spec
for the function sxhash says:

3. The hash-code for an object is always the same within a single
session provided that the object is not visibly modified with regard
to the equivalence test equal. See Section 18.1.2 (Modifying Hash
Table Keys).

And down from there:

For objects of types that equal compares with eq, item 3 requires that
the hash-code be based on some immutable quality of the identity of
the object.

The same thing can be concluded at the section 18.1.2 of hyperspec,
namely "Modifying Hash Table Keys", which states that an array can
only be modified with regard to the test equalp.
http://www.lispworks.com/documentation/HyperSpec/Body/18_ab.htm

OTOH, hashing different but equivalent arrays may lead to either equal
or different results. This means that the contents of the array
_cannot_ be used to compute the hash.

It would be better if sxhash accepted an optional test function (among
eq, eql, equal and equalp) and computed the hash according to that
test. So, to compute a hash that uses the array contents, one could
call

(sxhash ar #'equalp)

and someone who cares about identity could call

(sxhash ar #'eq)

This way, everyone would be happy :)

From: Thomas F. Burdick
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 06:54:37 +0000
Message-ID: <2a2dc75f-de0c-4f1c-9fde-e78896ca4742@w40g2000yqd.googlegroups.com>

On Apr 22, 6:11 am, ··········@gmail.com wrote:
> On 21 abr, 23:13, Madhu <·······@meer.net> wrote:
>
> I think you (and a few other people) are mistaking the behaviour of
> the equal test for arrays.

(And it's fairly easy to see why, given EQUAL's behavior for that very
common sort of array, the string)

Actually, I think this thread is based on a more basic mistake :
thinking that the CL hash-table interface is fully fleshed out and
generically useful. The spec for SXHASH isn't quite constrained enough
to guarantee that you can portably use it to, for example, implement
EQ tries, which would be an actual use case. You need to depend either
on implementation details or on an implementation-specific function if
you want your tries to work as intended. Standard hash-tables work
fine for the large number of times where the case you need is covered
by EQ, EQL, EQUAL or EQUALP. If your needs pass beyond this, you're
back to non-portability.

I'm sure this was a reasonable compromise at the time, and it's better
than having a bad interface to extensible hash-tables be mandated by
the standard. You can see a hint of what that could have looked like
with SXHASH, which looks useful on first blush.

From: Madhu
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 04:37:52 +0000
Message-ID: <m3mya99ucf.fsf@moon.robolove.meer.net>

* ··········@gmail.com in article
<····································@u8g2000yqn.googlegroups.com> :
Wrote on Tue, 21 Apr 2009 21:11:15 -0700 (PDT):

|> The bug may just be in your expectation of what SXHASH can do. In CL
|> arrays are mutable. �The mutability of arrays has consequences for EQUAL
|> hash tables --- The same array with a different contents should produce
|> a different key. �Now EQUAL hash tables should consider contents anyway
|> so SXHASH can at best compute a bucket where all elements have to be
|> compared individually.
|
| I think you (and a few other people) are mistaking the behaviour of
| the equal test for arrays.

Sorry for the confusion, That was a typo on my part. I should have
writtern "EQUALP hash tables" above, not "EQUAL hash tables"

But see Article Tim Bradshaw's post in this thread 

  <·········································@t11g2000vbc.googlegroups.com>

about why an SXHASH that uses identity for arrays would be useless for
an implementation to use for implementing hash tables
Similarly an SXHASH that uses contents.


| [1]> (equal #(1 2 3) #(1 2 3))
| NIL
|
| Two arrays are equal iff they are eq (except for strings and bit
| vectors). And sxhash is meant to work with equal. This means that
| hashing the same array with different contents (e.g. hashing before
| and after you modify it) should produce the _same_ result, as the spec
| for the function sxhash says:

Indeed.

| 3. The hash-code for an object is always the same within a single
| session provided that the object is not visibly modified with regard
| to the equivalence test equal. See Section 18.1.2 (Modifying Hash
| Table Keys).

| And down from there:
|
| For objects of types that equal compares with eq, item 3 requires that
| the hash-code be based on some immutable quality of the identity of
| the object.

See my parallel response in <···················@moon.robolove.meer.net>
on this "Item 3" and the note below.  In the case of arrays, I believe
TFB'S considerations outweigh the suggestion in this `Note'.

| The same thing can be concluded at the section 18.1.2 of hyperspec,
| namely "Modifying Hash Table Keys", which states that an array can
| only be modified with regard to the test equalp.
| http://www.lispworks.com/documentation/HyperSpec/Body/18_ab.htm

I'm not sure what you mean by that statement.  If an array used as a key
is "visibly modified", (as defined in this section', future behaviour of
the hash table is "unspecified".  Only negative conclusions are possible
from this section.  

| OTOH, hashing different but equivalent arrays may lead to either equal
| or different results. This means that the contents of the array
| _cannot_ be used to compute the hash.
|
| It would be better if sxhash accepted an optional test function (among
| eq, eql, equal and equalp) and computed the hash according to that
| test. So, to compute a hash that uses the array contents, one could
| call
|
| (sxhash ar #'equalp)
|
| and someone who cares about identity could call
|
| (sxhash ar #'eq)
|
| This way, everyone would be happy :)

And what about user defined tests?

I think its better to use SXHASH for the intended purpose, and use other
"outside the CL standard" implementation specific mechanisms if
extensions are required.

--
Madhu

From: gugamilare
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 04:56:42 +0000
Message-ID: <d737214d-dae5-4487-92de-151366ecd75a@q16g2000yqg.googlegroups.com>

On 22 abr, 01:37, Madhu <·······@meer.net> wrote:
> * ··········@gmail.com in article
> <····································@u8g2000yqn.googlegroups.com> :
> Wrote on Tue, 21 Apr 2009 21:11:15 -0700 (PDT):
>
(Just to clarify, it was me talking here, it was just a login problem)

> And what about user defined tests?

User defined tests already can't be used in hashtables, so I believe
this is a related problem. The answer here would be both sxhash and
make-hash-table accept other hash and test functions. Maybe defining a
map 1-1 on test and hash functions, in such a way that sxhash and make-
hashtable would signal errors if some non-defined test function is
provided. But, since hashtables can't accept user defined tests in the
current state, sxhash shouldn't as well.
>
> I think its better to use SXHASH for the intended purpose, and use other
> "outside the CL standard" implementation specific mechanisms if
> extensions are required.

The problem is that one need to implement a hash function for the
equalp test. But all implementations already have it defined, it is a
waste of time and a duplication of efforts to implement it.
>
> --
> Madhu

From: Madhu
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 05:09:52 +0000
Message-ID: <m3r5zlnujj.fsf@moon.robolove.meer.net>

* gugamilare <····································@q16g2000yqg.googlegroups.com> :
Wrote on Tue, 21 Apr 2009 21:56:42 -0700 (PDT):

|> I think its better to use SXHASH for the intended purpose, and use
|> other "outside the CL standard" implementation specific mechanisms if
|> extensions are required.
|
| The problem is that one need to implement a hash function for the
| equalp test. But all implementations already have it defined, it is a
| waste of time and a duplication of efforts to implement it.

(reduce #'+ array :key #'sxhash)

(reduce #'logior array :key #'sxhash)


have been suggested before on cll (i think)

--
Madhu

From: gugamilare
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 05:28:07 +0000
Message-ID: <8f44102b-9b33-4c01-a421-b521745f9ee7@t21g2000yqi.googlegroups.com>

On 22 abr, 02:09, Madhu <·······@meer.net> wrote:
> * gugamilare <····································@q16g2000yqg.googlegroups.com> :
> Wrote on Tue, 21 Apr 2009 21:56:42 -0700 (PDT):
>
> |> I think its better to use SXHASH for the intended purpose, and use
> |> other "outside the CL standard" implementation specific mechanisms if
> |> extensions are required.
> |
> | The problem is that one need to implement a hash function for the
> | equalp test. But all implementations already have it defined, it is a
> | waste of time and a duplication of efforts to implement it.
>
> (reduce #'+ array :key #'sxhash)
>
> (reduce #'logior array :key #'sxhash)
>
> have been suggested before on cll (i think)
>
> --
> Madhu

Ok, this works for arrays. It is still missing hash functions for
other kind of arguments. Besides, I believe this is a not a good
implementation because it doesn't have enough entropy. #(1 2) would
return the same hash as #(2 1) would. It would be better to have a
sequence of primes (or something near that, e.g., the sequence of
odds) and multiply to each sxhash obtained from the elements of the
array. Something like

(loop for elt across array
      for i from 3 by 2
      summing (mod (* i (sxhash elt)) most-positive-fixnum)

This version may as well have defects. This being a standard would be
much better. Not that I am complaining, I never needed to hash
anything.

From: Thomas A. Russ
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 18:37:23 +0000
Message-ID: <ymi1vrkmt5o.fsf@blackcat.isi.edu>

Madhu <·······@meer.net> writes:

> * gugamilare <····································@q16g2000yqg.googlegroups.com> :
> Wrote on Tue, 21 Apr 2009 21:56:42 -0700 (PDT):
> 
> |> I think its better to use SXHASH for the intended purpose, and use
> |> other "outside the CL standard" implementation specific mechanisms if
> |> extensions are required.
> |
> | The problem is that one need to implement a hash function for the
> | equalp test. But all implementations already have it defined, it is a
> | waste of time and a duplication of efforts to implement it.
> 
> (reduce #'+ array :key #'sxhash)
> 
> (reduce #'logior array :key #'sxhash)
> 
> 
> have been suggested before on cll (i think)

But it doesn't work.  Try this:

(defun my-hash-code (array)
  (reduce #'+ array :key #'sxhash))

(defvar *ht* (make-hash-table :test #'equalp))
(defvar *array* (make-array 3 :initial-contents '(1 2 3)))

(setf (gethash (my-hash-code *array*) *ht*) 'some-value)

(gethash (my-hash-code *array*) *ht*)
(gethash (my-hash-code (make-array 3 :initial-contents '(1 2 3))) *ht*)

(setf (aref *array* 1) 10)

(gethash (my-hash-code *array*) *ht*)
(gethash (my-hash-code (make-array 3 :initial-contents '(1 10 3))) *ht*)


But as a general method, this also shows that it's pretty easy to
construct a hash-table that uses your own hash function.


-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: Madhu
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 23:13:10 +0000
Message-ID: <m3r5zkmge1.fsf@moon.robolove.meer.net>

* (Thomas A. Russ) <···············@blackcat.isi.edu> :
Wrote on 22 Apr 2009 11:37:23 -0700:

| Madhu <·······@meer.net> writes:
|> | The problem is that one need to implement a hash function for the
|> | equalp test. But all implementations already have it defined, it is a
|> | waste of time and a duplication of efforts to implement it.
|> 
|> (reduce #'+ array :key #'sxhash)
|> 
|> (reduce #'logior array :key #'sxhash)
|> have been suggested before on cll (i think)
|
| But it doesn't work.  Try this:

[example snipped]

But the hash code can only be used to get the bucket in which the
element can lie. [I ought to have added that] you should resolve
collisions manually---since you are implementing your own hash
table---perhaps by storing the key value in an alist and doing an ASSOC
:test TEST on all elements in the bucket after using GETHASH on the
hash-code to get at the alist.

But that said example you posted, does not seem to demonstrate collision
so I suspect it is missing a step, or I am missing the point.

--
Madhu

From: Thomas A. Russ
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 18:27:58 +0000
Message-ID: <ymi63gwmtld.fsf@blackcat.isi.edu>

··········@gmail.com writes:

> It would be better if sxhash accepted an optional test function (among
> eq, eql, equal and equalp) and computed the hash according to that
> test. So, to compute a hash that uses the array contents, one could
> call
> 
> (sxhash ar #'equalp)

I think a hash that uses the array contents would fail, since you would
then be in the position that

  (sxhash ar #'equalp)  =>  M
  (setf (aref ar 10) some-new-value)
  (sxhash ar #'equalp)  =>  N

where M /= N.

Any attempt to find an item stored with the array as a key before
modification would not be guaranteed to be found after modification
because the hash code might have changed, rendering the first value
unfindable.


-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: ·····@sherbrookeconsulting.com
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 19:25:59 +0000
Message-ID: <9adda0ed-86f6-4091-9a57-d09b51745d76@q16g2000yqg.googlegroups.com>

On Apr 22, 2:27 pm, ····@sevak.isi.edu (Thomas A. Russ) wrote:
> ··········@gmail.com writes:
[...]
> > (sxhash ar #'equalp)

> I think a hash that uses the array contents would fail, since you would
> then be in the position that

>   (sxhash ar #'equalp)  =>  M
>   (setf (aref ar 10) some-new-value)
>   (sxhash ar #'equalp)  =>  N

> where M /= N.

AIUI, that is a place where conformant implementations are allowed to
fail, because the consequences of modifying a hash table key in a
"visible" way are undefined.

Cheers,
Pillsy

From: gugamilare
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 19:41:52 +0000
Message-ID: <b5bafdb5-d684-4d5b-b946-5b7b1bd07a87@s20g2000yqh.googlegroups.com>

On 22 abr, 15:27, ····@sevak.isi.edu (Thomas A. Russ) wrote:
> ··········@gmail.com writes:
> > It would be better if sxhash accepted an optional test function (among
> > eq, eql, equal and equalp) and computed the hash according to that
> > test. So, to compute a hash that uses the array contents, one could
> > call
>
> > (sxhash ar #'equalp)
>
> I think a hash that uses the array contents would fail, since you would
> then be in the position that
>
>   (sxhash ar #'equalp)  =>  M
>   (setf (aref ar 10) some-new-value)
>   (sxhash ar #'equalp)  =>  N
>
> where M /= N.

But, if you are using an array as an index, you are not supposed to
modify its contents. This is just unsafe code, it is not supposed to
work anyway, just like an infinite loop.

From: Thomas A. Russ
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 20:34:07 +0000
Message-ID: <ymiocuol96o.fsf@blackcat.isi.edu>

gugamilare <··········@gmail.com> writes:

> But, if you are using an array as an index, you are not supposed to
> modify its contents. This is just unsafe code, it is not supposed to
> work anyway, just like an infinite loop.

Ah, good point.  I had overlooked that restriction.


-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: Waldek Hebisch
Subject: Re: Hashing arrays
Date: Fri, 24 Apr 2009 15:11:18 +0000
Message-ID: <gsskqm$93j$1@z-news.wcss.wroc.pl>

··········@gmail.com wrote:
> On 21 abr, 23:13, Madhu <·······@meer.net> wrote:
> > * Waldek Hebisch <············@z-news.wcss.wroc.pl> :
> > Wrote on Tue, 21 Apr 2009 19:01:11 +0000 (UTC):
> >
> > | I have noticed that in all Lisp implementations that I could
> > | try (clisp, Closure CL, ecl, gcl, Poplog Clisp, sbcl) value
> > | of sxhash applied to an array apparently does not depend on
> > | content of array. �In gcl and ecl I got different values
> > | for differet arrays, in other implementations I get the
> > | same value for arrays of the same size.
> > |
> > | Below is a simple test I tried:
> >
> > []
> >
> > | This behaviour makes sxhash useless for hashing arrays, so IMHO
> > | is a bug. �The question is if this behaviour is mandated by
> > | the standard, or just several implementations independently
> > | have the same bug?
> >
> > The bug may just be in your expectation of what SXHASH can do. In CL
> > arrays are mutable. �The mutability of arrays has consequences for EQUAL
> > hash tables --- The same array with a different contents should produce
> > a different key. �Now EQUAL hash tables should consider contents anyway
> > so SXHASH can at best compute a bucket where all elements have to be
> > compared individually.
> 
> I think you (and a few other people) are mistaking the behaviour of
> the equal test for arrays.
> 
> [1]> (equal #(1 2 3) #(1 2 3))
> NIL
> 
> Two arrays are equal iff they are eq (except for strings and bit
> vectors). And sxhash is meant to work with equal. This means that
> hashing the same array with different contents (e.g. hashing before
> and after you modify it) should produce the _same_ result, as the spec
> for the function sxhash says:
> 
> 3. The hash-code for an object is always the same within a single
> session provided that the object is not visibly modified with regard
> to the equivalence test equal. See Section 18.1.2 (Modifying Hash
> Table Keys).
> 
> And down from there:
> 
> For objects of types that equal compares with eq, item 3 requires that
> the hash-code be based on some immutable quality of the identity of
> the object.
> 

I missed the interaction of point 3 above with equal test -- both
_together_ imply that array content can not be used for hashing.
Without point 3 implementation could still use content, because
array which are eq in typical implementation have equal content
(I wrote "in typical implementation" because I was unable to find
definitive statement in the spec).

Still, this is IHMO serious deficiency in the spec.  Namely,
hash tables are used to gain performance.  I can implement
my own hash function, but it is unlikely to be as fast as hash
function tuned by implementers for specific implementation.

-- 
                              Waldek Hebisch
·······@math.uni.wroc.pl

From: Tamas K Papp
Subject: Re: Hashing arrays
Date: Fri, 24 Apr 2009 16:34:51 +0000
Message-ID: <75e81bF17me3rU1@mid.individual.net>

On Fri, 24 Apr 2009 15:11:18 +0000, Waldek Hebisch wrote:

> Still, this is IHMO serious deficiency in the spec.  Namely, hash tables
> are used to gain performance.  I can implement my own hash function, but
> it is unlikely to be as fast as hash function tuned by implementers for
> specific implementation.

That was my initial thought, but then I realized that 

1) since Lisp arrays are incredibly baroque (think fill pointers,
displaced arrays, etc), it would be extremely difficult if not
impossible to handle that,

2) I am not entirely sure that using arrays larger than a few elements
as hash keys is a good idea (I don't know the context of the problem
though), and if we are talking about a few elements, you can always
use lists,

3) As a last resort, you can implement your own hash tables, and you
can make them as fast as your implementation's.

Tamas

From: Thomas A. Russ
Subject: Re: Hashing arrays
Date: Fri, 24 Apr 2009 19:13:40 +0000
Message-ID: <ymik559eufv.fsf@blackcat.isi.edu>

Tamas K Papp <······@gmail.com> writes:

> 3) As a last resort, you can implement your own hash tables, and you
> can make them as fast as your implementation's.

This may be a pretty good solution, and perhaps not necessarily a last
resort.  Since you have more specialized knowledge of what you are using
as a key, you could actually make things MORE efficient the the
implementation's general-purpose hash tables.  You don't have to
necessarily meet all of the strictures of SXHASH, but can instead tailor
the hash function to the specifics of your data and data structure.

I would think that this could lead to a very good hash table
implementation.  The only really difficult part of hash table design is
collision handling.  The rest is very simple to implement.

-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: Kaz Kylheku
Subject: Re: Hashing arrays
Date: Fri, 24 Apr 2009 20:16:46 +0000
Message-ID: <20090504213046.591@gmail.com>

On 2009-04-24, Thomas A. Russ <···@sevak.isi.edu> wrote:
> Tamas K Papp <······@gmail.com> writes:
>
>> 3) As a last resort, you can implement your own hash tables, and you
>> can make them as fast as your implementation's.
>
> This may be a pretty good solution, and perhaps not necessarily a last
> resort.

Reinventing the wheel is not only a great first resort, but for some
programmers it's a like a five star resort in a tropical paradise.

:)

From: ······@nhplace.com
Subject: Re: Hashing arrays
Date: Wed, 22 Apr 2009 16:22:56 +0000
Message-ID: <94a77a38-f26b-4f6e-9a37-9145ff21c75d@d14g2000yql.googlegroups.com>

On Apr 21, 10:13 pm, Madhu <·······@meer.net> wrote:
> * Waldek Hebisch <············@z-news.wcss.wroc.pl> :
> Wrote on Tue, 21 Apr 2009 19:01:11 +0000 (UTC):
>
> | I have noticed that in all Lisp implementations that I could
> | try (clisp, Closure CL, ecl, gcl, Poplog Clisp, sbcl) value
> | of sxhash applied to an array apparently does not depend on
> | content of array.  In gcl and ecl I got different values
> | for differet arrays, in other implementations I get the
> | same value for arrays of the same size.
> |
> | Below is a simple test I tried:
>
> []
>
> | This behaviour makes sxhash useless for hashing arrays, so IMHO
> | is a bug.  The question is if this behaviour is mandated by
> | the standard, or just several implementations independently
> | have the same bug?
>
> The bug may just be in your expectation of what SXHASH can do. In CL
> arrays are mutable.  The mutability of arrays has consequences for EQUAL
> hash tables --- The same array with a different contents should produce
> a different key.  Now EQUAL hash tables should consider contents anyway
> so SXHASH can at best compute a bucket where all elements have to be
> compared individually.
>
> But still, all is not well.
>
> There is, I think an important deficiency in the the GETHASH
> specification.  It does not mention anything about the copying (or
> non-copying) of keys.  While this is not an issue in dumbed down
> languages where all objects are made immutable to the programmer, again,
> in CL arrays are mutable, and if you are passing an adjustable array
> (say) to puthash (i.e. (SETF GETHASH)) without making a copy, you may be
> in for "surprises".
>
> [I thought I'd posted about this asking for KMP's clarification back in
>  2004, but I realize now it never made it to usenet.  Also my private
>  mail to KMP on some clarifications seems to have been ignored]
>
> --
> Madhu

Sorry.  I get heaps of mail and it sometimes gets buried.  Nothing
personal.

I don't disagree with your general analysis here other than to say
that the spec is not obliged to mark all caveats.  There is sufficient
information to deduce that there's a potential problem, and it is
indeed a hidden danger that it would have been kind of me to remark
about had I thought to, but I'll stop short of calling it a deficiency
in any formal sense.

The issue of mutable data is indeed problematic from the point of view
of hashing.  Lists have a similar issue if you rplaca/rplacd them, I
think.  So it's not just arrays.  I think in general it's fair to say
that when you're passing mutable structure data, you should always
document whether that data can be modified.  I think further that
since people don't and since there must be a default, people wondering
if they can modify stuff should not, and people wondering if they can
assume stuff won't be modified should not.  That means more work has
to be done than should be because there is often copying on both
sides.  But it is the most conservative safe thing.