suitable index for near matches of strings?

From: Christian Nyb�
Subject: suitable index for near matches of strings?
Date: Tue, 26 Sep 2000 00:00:00 +0000
Message-ID: <874s33p1ja.fsf@lapchr.siteloft.com>

Hi,

I'm looking for a suitable way of indexing a particular slot in a list
of structs.  The slot has a string as its value.  As I'd like to be
able to search for a substring that matches from the beginning of the
string, I'm not sure whether a hashtable of the strings is a usable
data structure, as the time used to maphash across the list of structs
should grow proportinally to the length of the list.  One option that
I've tried is to populate a hash table with all possible matches for a
particular name, that's fast for small data sets, but the index gets
too big for the data I'd like to search.

For exact matches, a binary search tree or a hash work great, but how
should I arrange the index in order to search for substrings?

Both pointers to literature as well as descriptions of ways to do this
will be appreciated.
-- 
chr

Re: suitable index for near matches of strings? Barry Margolin
- Re: suitable index for near matches of strings? Erik Naggum
  - Re: suitable index for near matches of strings? Russell Senior
    - Re: suitable index for near matches of strings? Reini Urban
    - Re: suitable index for near matches of strings? Boris Schaefer
  - Re: suitable index for near matches of strings? Boris Schaefer
Re: suitable index for near matches of strings? ·····@alum.mit.edu
Re: suitable index for near matches of strings? Fernando Rodr�guez

From: Barry Margolin
Subject: Re: suitable index for near matches of strings?
Date: Tue, 26 Sep 2000 00:00:00 +0000
Message-ID: <_84A5.20$aT.541@burlma1-snr2>

In article <··············@lapchr.siteloft.com>,
Christian Nyb� <·····@eunet.no> wrote:
>Hi,
>
>I'm looking for a suitable way of indexing a particular slot in a list
>of structs.  The slot has a string as its value.  As I'd like to be
>able to search for a substring that matches from the beginning of the
>string, I'm not sure whether a hashtable of the strings is a usable
>data structure, as the time used to maphash across the list of structs
>should grow proportinally to the length of the list.  One option that
>I've tried is to populate a hash table with all possible matches for a
>particular name, that's fast for small data sets, but the index gets
>too big for the data I'd like to search.

A good place to start would probably be Knuth's "Sorting and Searching"
book.  Look for data structures like B-trees, B*-trees, prefix trees, etc.

Also, check out literature for database system implementation, as many of
them make use of these for their indices, and Knuth is a bit out of date.

-- 
Barry Margolin, ······@genuity.net
Genuity, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: Erik Naggum
Subject: Re: suitable index for near matches of strings?
Date: Sat, 07 Oct 2000 00:00:00 +0000
Message-ID: <3179934072768685@naggum.net>

* Barry Margolin <······@genuity.net>
| A good place to start would probably be Knuth's "Sorting and Searching"
| book.  Look for data structures like B-trees, B*-trees, prefix trees, etc.
| 
| Also, check out literature for database system implementation, as many of
| them make use of these for their indices, and Knuth is a bit out of date.

  Knuth's TAOCP has recently been revised, and Sorting and Searching
  is in the second edition, published 1998.  Is this still out of date?

#:Erik
-- 
  If this is not what you expected, please alter your expectations.

From: Russell Senior
Subject: Re: suitable index for near matches of strings?
Date: Sat, 07 Oct 2000 00:00:00 +0000
Message-ID: <86wvfkynmm.fsf@coulee.tdb.com>

>>>>> "Erik" == Erik Naggum <····@naggum.net> writes:

Erik>   Knuth's TAOCP has recently been revised, and Sorting and
Erik> Searching is in the second edition, published 1998.  Is this
Erik> still out of date?

Here is a data point (or some fraction of one) ...

Do not consider this definitive, but having recently purchased the 3rd
edition of vol2, my shallow impression is that the revision was
performed with a fairly light brush.  Here is an excerpt from the
preface:

  [...] The _Art of Computer Programming_ is, however, still a work in
  progress.  Research on seminumerical algorithms continues to grow at
  a phenomenal rate.  Therefore some parts of this book are headed by
  an ``under construction'' icon, to apologize for the fact that the
  material is not up-to-date. [...]

I have no idea if this remark applies to the other volumes.

-- 
Russell Senior         ``The two chiefs turned to each other.        
·······@aracnet.com      Bellison uncorked a flood of horrible       
                         profanity, which, translated meant, `This is
                         extremely unusual.' ''

From: Reini Urban
Subject: Re: suitable index for near matches of strings?
Date: Mon, 09 Oct 2000 00:00:00 +0000
Message-ID: <39e1f110.787306406@judy>

>>>>> "Erik" == Erik Naggum <····@naggum.net> writes:
Erik>   Knuth's TAOCP has recently been revised, and Sorting and
Erik> Searching is in the second edition, published 1998.  Is this
Erik> still out of date?

Russell Senior wrote:
>Do not consider this definitive, but having recently purchased the 3rd
>edition of vol2, 

me as well. but erik mentioned vol3 (2), not vol2 (3).

>my shallow impression is that the revision was
>performed with a fairly light brush.  Here is an excerpt from the
>preface:
>
>  [...] The _Art of Computer Programming_ is, however, still a work in
>  progress.  Research on seminumerical algorithms continues to grow at
>  a phenomenal rate.  Therefore some parts of this book are headed by
>  an ``under construction'' icon, to apologize for the fact that the
>  material is not up-to-date. [...]
>
>I have no idea if this remark applies to the other volumes.

IMHO vol2 (seminumerical algorithms) is the most up-to-date one (but i'm
no numeric specialist), 
vol1 VERY far behind (too small for a complete volume anyway), 
vol3 in some parts behind, but okay for most cases.
for vol1 exist some updates for the new pseudo MMIX cpu at his website. 
this is quite modern. not transmeta, but similar.

parts of vol4 (combinatorics, graphs, ai and game stuff) are currently
in beta. this would be the most interesting, but I doubt if he can
represent it with MMIX-only enjoyable enough. I would rather read norvig
twice.
-- 
Reini Urban
http://xarch.tu-graz.ac.at/autocad/news/faq/autolisp.html

From: Boris Schaefer
Subject: Re: suitable index for near matches of strings?
Date: Sat, 14 Oct 2000 00:00:00 +0000
Message-ID: <87g0lzzmtf.fsf@qiwi.uncommon-sense.net>

Russell Senior <·······@aracnet.com> writes:

| Do not consider this definitive, but having recently purchased the 3rd
| edition of vol2, my shallow impression is that the revision was
| performed with a fairly light brush.  Here is an excerpt from the
| preface:
| 
|   [...] The _Art of Computer Programming_ is, however, still a work in
|   progress.  Research on seminumerical algorithms continues to grow at
|   a phenomenal rate.  Therefore some parts of this book are headed by
|   an ``under construction'' icon, to apologize for the fact that the
|   material is not up-to-date. [...]
| 
| I have no idea if this remark applies to the other volumes.

Yes, it applies to the new editions of all volumes.  I don't know the
older editions, but I remember seeing on Knuth's web page that after
he has finished vol.4, he wants to do the grand and final revision of
the first three volumes (in some 15 years or so, I think he said).

-- 
·····@uncommon-sense.net - <http://www.uncommon-sense.net/>

Two wrights don't make a rong, they make an airplane.  Or bicycles.

From: Boris Schaefer
Subject: Re: suitable index for near matches of strings?
Date: Sat, 14 Oct 2000 00:00:00 +0000
Message-ID: <87k8bbzn19.fsf@qiwi.uncommon-sense.net>

Erik Naggum <····@naggum.net> writes:

| Knuth's TAOCP has recently been revised, and Sorting and Searching
| is in the second edition, published 1998.  Is this still out of
| date?

Yes, a bit.  Take a look at pages 475-478 (the end of section 6.2.3)
to see some things that are not covered in the section about binary
search trees.  Some more notable missings are red-black trees, splay
trees, skip lists and treaps.  At the end of p.478 Knuth says that he
plans to include the following in the next edition: skip lists,
randomized binary search trees, and treaps.

Probably any book about data structures and algorithms is a) out of
date, and b) incomplete.  Even though not entirely up to date TAOCP is
still very good, and probably the most comprehensive book available.
Although, I don't know Cormen's Introduction to Algorithms, which is
supposed to be good as well.

-- 
·····@uncommon-sense.net - <http://www.uncommon-sense.net/>

If you know the answer to a question, don't ask.
		-- Petersen Nesbit

From: ·····@alum.mit.edu
Subject: Re: suitable index for near matches of strings?
Date: Tue, 26 Sep 2000 00:00:00 +0000
Message-ID: <m23dimzvat.fsf@cadet.dsl.speakeasy.net>

·····@eunet.no (Christian Nyb�) writes:

> I'm looking for a suitable way of indexing a particular slot in a
> list of structs.  The slot has a string as its value.  As I'd like
> to be able to search for a substring that matches from the beginning
> of the string, ...

I'm not sure I follow exactly, but I've used the ``trie'' data
structure before, and it's easy enough to implement in CL, though you
will find existing implementations available out there.

One such implementation of tries is defined inside Xerox Parc's POS
tagger, and it's a complex `define-***'-style macro that gives some
options and defines all sorts of accessors and stuff for your trie.
This particular data structure makes excellent reuse of data when
dealing with large lexicons, and when searching starting at the
beginning of the string, it's ideal.

It's also a very intuitive data structure.

dave

From: Fernando Rodr�guez
Subject: Re: suitable index for near matches of strings?
Date: Sun, 08 Oct 2000 00:00:00 +0000
Message-ID: <ack0usoo15il60e3t7el8m9etiurommla8@4ax.com>

On 26 Sep 2000 18:21:13 +0200, ·····@eunet.no (Christian Nyb�) wrote:

>Hi,
>
>I'm looking for a suitable way of indexing a particular slot in a list
>of structs.  The slot has a string as its value.  As I'd like to be
>able to search for a substring that matches from the beginning of the
>string, I'm not sure whether a hashtable of the strings is a usable
>data structure, as the time used to maphash across the list of structs

A hash table will only work if the number of strings is very small.
You should consider taking a look at "tries" and PAT trees.  Probably
a trie is the easiest solution.  

A very good book about this sort of stuff is "Information Retrieval"
by Baeza-Yates. It also comes with samples, though in C.

Good luck,

//-----------------------------------------------
//	Fernando Rodriguez Romero
//
//	frr at mindless dot com
//------------------------------------------------