From: Reini Urban
Subject: string or int hashtables
Date: 
Message-ID: <3B90219F.D076A7CB@x-ray.at>
I need to search in a big word dictionary (similar to nick levin's fast-index)

Now I have two options:

A string hash-table with :test equal
or a numified hash-table with :test eql, where the number is a condensed but 
unique integer representation of the string. so sxhash is not enough.

The problem with lisp is the different and rather small number of bytes for
immediate ints.
with 32 bit I get about 15-16 chars per word, with 24 much less. 
So on any word longer than this upper limit the hash has to be upgraded to the
new type. 
I think this will be slower than an initial string hash and :test equal.
I believe that the longest word will have 34 chars.
It should run on clisp, cmucl, corman lisp and scsh.
It should run fast, size does not matter.

So I'm stuck to equal, or?
I could also make two hashes seperated on the word length. 
an immediate eq hash with ints, and an equal for the few long words. 
is this common practice or does everybody just use strings?
-- 
Reini Urban
http://xarch.tu-graz.ac.at/home/rurban/

From: Louis Theran
Subject: Re: string or int hashtables
Date: 
Message-ID: <h0kwv3j21a8.fsf@genet.cs.umass.edu>
Reini Urban <······@x-ray.at> writes:

> I need to search in a big word dictionary 
> 
  [ ... ]
> A string hash-table with :test equal
> or a numified hash-table with :test eql, where the number is a condensed but 
> unique integer representation of the string. so sxhash is not enough.

I don't see how the second is generally possible.  Anyway, my
suggestion is to just build and on-disk hash along the lines of CDB
(http://cr.yp.to/cdb/cdb/txt) if the dictionary doesn't change too
often. 

^L
From: Joe Nall
Subject: Re: string or int hashtables
Date: 
Message-ID: <3B903F69.FC6E8DF7@nall.com>
Reini Urban wrote:
> 
> I need to search in a big word dictionary (similar to nick levin's fast-index)
> 
> Now I have two options:
> 
> A string hash-table with :test equal
> or a numified hash-table with :test eql, where the number is a condensed but
> unique integer representation of the string. so sxhash is not enough.

As a relative Lisp beginner I may be missing something, but the # of
buckets in the hash table and the distribution across the buckets should
be the performance drivers in this application. The :test time is only a
factor as the bucket length grows.

joe
From: Kent M Pitman
Subject: Re: string or int hashtables
Date: 
Message-ID: <sfw7kvja62j.fsf@world.std.com>
Reini Urban <······@x-ray.at> writes:

> I need to search in a big word dictionary (similar to nick levin's fast-index)
> Now I have two options:
> 
> A string hash-table with :test equal
> or a numified hash-table with :test eql, where the number is a condensed but 
> unique integer representation of the string. so sxhash is not enough.
> 
> The problem with lisp is the different and rather small number of bytes for
> immediate ints.
> with 32 bit I get about 15-16 chars per word, with 24 much less. 
> So on any word longer than this upper limit the hash has to be upgraded to the
> new type. 
> I think this will be slower than an initial string hash and :test equal.
> I believe that the longest word will have 34 chars.
> It should run on clisp, cmucl, corman lisp and scsh.
> It should run fast, size does not matter.
> 
> So I'm stuck to equal, or?
> I could also make two hashes seperated on the word length. 
> an immediate eq hash with ints, and an equal for the few long words. 
> is this common practice or does everybody just use strings?

Maybe I'm the only one but I just don't understand the problem description
well enough to even guess how to begin looking for a solution.

I'm always weirded out by an "initial" problem description that begins 
with "I have two options".  e.g., is it clear a priori that simple binary
search in a packed table would not be in the running?

And, btw, is this a "big" "word dictionary" or a "big word" "dictionary"?
From: Tim Bradshaw
Subject: Re: string or int hashtables
Date: 
Message-ID: <nkj4rqkd1m6.fsf@davros.tardis.ed.ac.uk>
Reini Urban <······@x-ray.at> writes:

> I need to search in a big word dictionary (similar to nick levin's fast-index)
> 
> Now I have two options:
> 
> A string hash-table with :test equal
> or a numified hash-table with :test eql, where the number is a condensed but 
> unique integer representation of the string. so sxhash is not enough.
> 


As someone else said, the performance of a hashtable - if the
collisions are low, should be dominated by the time-to-hash, not the
equality test.  So if you make a suitably big hashtable you should be
doing pretty well.

But you have at *least* one more option, especially if size does not
matter (which I think you said).

(defun make-dictionary (name)
  (make-package name :use '()))

(defun intern-in-dictionary (word dict)
  (intern word dict))

(defun lookup-in-dictionary (word dict)
  (let ((f (find-symbol word dict)))
    (if f (symbol-name f) nil)))


This is typed in without access to a Lisp to test it, but something
like this should work, and probably be quick but waste a lot of space
for all the symbol slots.

There may well be other options as well.

--tim
From: Barry Margolin
Subject: Re: string or int hashtables
Date: 
Message-ID: <4K7l7.6$nH5.281@burlma1-snr2>
In article <···············@davros.tardis.ed.ac.uk>,
Tim Bradshaw  <···@tfeb.org> wrote:
>But you have at *least* one more option, especially if size does not
>matter (which I think you said).
>
>(defun make-dictionary (name)
>  (make-package name :use '()))

This seems like a pretty silly solution, since a package will usually be
implemented internally as a string hashtable.

-- 
Barry Margolin, ······@genuity.net
Genuity, Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.
From: Tim Bradshaw
Subject: Re: string or int hashtables
Date: 
Message-ID: <ey3lmjuney7.fsf@cley.com>
* Barry Margolin wrote:

> This seems like a pretty silly solution, since a package will usually be
> implemented internally as a string hashtable.

I agree.  However I once did some tests which had packages
outperforming any hashtable I could make (that wasn't a package).  I
forget which implementation it was unfortunately.  Really my point was
that there are hashtable-like objects which are specialised to strings
- packages.

--tim