Relative efficiencies/costs

From: Greg Menke
Subject: Relative efficiencies/costs
Date: Thu, 10 May 2001 18:23:30 +0000
Message-ID: <m3wv7prrkt.fsf@europa.mindspring.com>

I have a hash table of lists, keyed by fixnums.  Each of the lists
contain some quantity of arbitrary data- they could be turned into
structs or classes- but they're lists at the moment.  Elsewhere I have
a list that contains only fixnums, each of which relates to a key in
the hashtable.  I traverse the list, using the fixnums to get the
right lists from the hash table.  Once generated, this list of indices
is static, but random access is not needed- only sequential, starting
from the head.

My specific question is would it be "better" to use references to the
correct lists rather than indirecting thru the hash table?  I was
thinking "better" might be measured in memory consumption.  The
reference approach would certainly execute faster.

I'm also wondering if "bind" would be a better word in this case than
"reference"...

Thanks,.

Gregm

Re: Relative efficiencies/costs Barry Margolin
- Re: Relative efficiencies/costs Greg Menke
  - Re: Relative efficiencies/costs Barry Margolin
    - Re: Relative efficiencies/costs Greg Menke
Re: Relative efficiencies/costs Marco Antoniotti
- Re: Relative efficiencies/costs Greg Menke
  - Re: Relative efficiencies/costs Pierre R. Mai
    - Re: Relative efficiencies/costs Greg Menke

From: Barry Margolin
Subject: Re: Relative efficiencies/costs
Date: Thu, 10 May 2001 19:02:48 +0000
Message-ID: <rBBK6.17$oj7.481@burlma1-snr2>

In article <··············@europa.mindspring.com>,
Greg Menke  <··········@mindspring.com> wrote:
>
>I have a hash table of lists, keyed by fixnums.  Each of the lists
>contain some quantity of arbitrary data- they could be turned into
>structs or classes- but they're lists at the moment.  Elsewhere I have
>a list that contains only fixnums, each of which relates to a key in
>the hashtable.  I traverse the list, using the fixnums to get the
>right lists from the hash table.  Once generated, this list of indices
>is static, but random access is not needed- only sequential, starting
>from the head.
>
>My specific question is would it be "better" to use references to the
>correct lists rather than indirecting thru the hash table?  I was
>thinking "better" might be measured in memory consumption.  The
>reference approach would certainly execute faster.

To do this, you would have to make a new association list, e.g.

(defun make-my-alist (key-list)
  (pairlis key-list
           (mapcar #'(lambda (x) (gethash x *my-hash-table*)))))

This requires allocating memory for this list, so it will almost certainly
use more memory than going through the hash table every time.  It will
allocate two cons cells for each element of the key-list.  But unless
key-list is huge, this probably won't be significant; and if key-list is
huge, the time savings from not going through the hash table may be
significant.  Stepping through an alist sequentially is quite easy, so if
you're going to be doing this repeatedly, I would probably recommend making
the alist rather than going through the hash table.

But the right thing to do is implement your code both ways and benchmark
them.

-- 
Barry Margolin, ······@genuity.net
Genuity, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: Greg Menke
Subject: Re: Relative efficiencies/costs
Date: Thu, 10 May 2001 19:26:00 +0000
Message-ID: <m3itj9atvb.fsf@europa.mindspring.com>

> >My specific question is would it be "better" to use references to the
> >correct lists rather than indirecting thru the hash table?  I was
> >thinking "better" might be measured in memory consumption.  The
> >reference approach would certainly execute faster.
> 
> To do this, you would have to make a new association list, e.g.
> 
> (defun make-my-alist (key-list)
>   (pairlis key-list
>            (mapcar #'(lambda (x) (gethash x *my-hash-table*)))))
> 
> This requires allocating memory for this list, so it will almost certainly
> use more memory than going through the hash table every time.  It will
> allocate two cons cells for each element of the key-list.  But unless
> key-list is huge, this probably won't be significant; and if key-list is
> huge, the time savings from not going through the hash table may be
> significant.  Stepping through an alist sequentially is quite easy, so if
> you're going to be doing this repeatedly, I would probably recommend making
> the alist rather than going through the hash table.

If I used the binding approach, I could dispense with the hashtable
altogether.  If so, is there a reason why the alist is needed?

The general question I forgot to ask was, is there an advantage with
respect to memory consumption by using a fixnum rather than binding to
an entry in some other structure?


> But the right thing to do is implement your code both ways and benchmark
> them.

I'm nearly at the point where I can do so... it'll give me a chance to
try the lw profiler too...

Gregm

From: Barry Margolin
Subject: Re: Relative efficiencies/costs
Date: Thu, 10 May 2001 19:42:40 +0000
Message-ID: <PaCK6.20$oj7.687@burlma1-snr2>

In article <··············@europa.mindspring.com>,
Greg Menke  <··········@mindspring.com> wrote:
>
>> >My specific question is would it be "better" to use references to the
>> >correct lists rather than indirecting thru the hash table?  I was
>> >thinking "better" might be measured in memory consumption.  The
>> >reference approach would certainly execute faster.
>> 
>> To do this, you would have to make a new association list, e.g.
>> 
>> (defun make-my-alist (key-list)
>>   (pairlis key-list
>>            (mapcar #'(lambda (x) (gethash x *my-hash-table*)))))
>> 
>> This requires allocating memory for this list, so it will almost certainly
>> use more memory than going through the hash table every time.  It will
>> allocate two cons cells for each element of the key-list.  But unless
>> key-list is huge, this probably won't be significant; and if key-list is
>> huge, the time savings from not going through the hash table may be
>> significant.  Stepping through an alist sequentially is quite easy, so if
>> you're going to be doing this repeatedly, I would probably recommend making
>> the alist rather than going through the hash table.
>
>If I used the binding approach, I could dispense with the hashtable

I assumed that the alist would just be temporary, i.e. you have some data
in the hash table, you have a list containing a subset of the keys, and for
a while you just need to work with that subset.  After you're done doing
that, you would then process a different subset.  If this isn't the case, I
don't understand why you have the hash table and the list of numbers in the
first place.

>altogether.  If so, is there a reason why the alist is needed?

The alist associates the keys with the lists, just like the hash table
does.  How else were you planning on determining which list goes with which
key?

>The general question I forgot to ask was, is there an advantage with
>respect to memory consumption by using a fixnum rather than binding to
>an entry in some other structure?

Both structures associate the fixnum with the list, they just provide
different ways of searching.  Hash tables are good for random access,
alists are good for sequential processing.

BTW, I forgot to mention previously that hash tables typically have quite a
bit of memory overhead.  For hashing to be efficient, the array that's used
to implement the hash table should not be too full, and IIRC it's optimal
when it's only about 30% full.  So the hash table will probably consume
significantly more memory than the list.

-- 
Barry Margolin, ······@genuity.net
Genuity, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: Greg Menke
Subject: Re: Relative efficiencies/costs
Date: Fri, 11 May 2001 02:24:49 +0000
Message-ID: <m3snicoc5q.fsf@europa.mindspring.com>

> I assumed that the alist would just be temporary, i.e. you have some data
> in the hash table, you have a list containing a subset of the keys, and for
> a while you just need to work with that subset.  After you're done doing
> that, you would then process a different subset.  If this isn't the case, I
> don't understand why you have the hash table and the list of numbers in the
> first place.

The hashtable is acting as a dictionary, there are multiple subsets.
My theory was a list of fixnums keying a hashtable would be somehow
more efficient/better than a list of bindings to the appropriate
entries.  Being a newbie, it was also a little easier to conceive of.

  
> The alist associates the keys with the lists, just like the hash table
> does.  How else were you planning on determining which list goes with which
> key?

By making each car of the subset list the head of the appropriate data
list.  As I build the subset list, I'd be temporarily using a hash
table to look up the data lists, which is discarded once the subset
lists are loaded.  Done that way, I could directly hash the input
tokens to the data lists without the fixnum intermediate key.

Maybe I should have done it that way in the first place...


> BTW, I forgot to mention previously that hash tables typically have quite a
> bit of memory overhead.  For hashing to be efficient, the array that's used
> to implement the hash table should not be too full, and IIRC it's optimal
> when it's only about 30% full.  So the hash table will probably consume
> significantly more memory than the list.

I know I'm loading the hashtable poorly, sequential fixnum keys added
sequentially, so a better approach would be useful anyhow...

Thanks,

Gregm

From: Marco Antoniotti
Subject: Re: Relative efficiencies/costs
Date: Fri, 11 May 2001 15:21:01 +0000
Message-ID: <y6cofszq5cy.fsf@octagon.mrl.nyu.edu>

Greg Menke <··········@mindspring.com> writes:

> I have a hash table of lists, keyed by fixnums.  Each of the lists
> contain some quantity of arbitrary data- they could be turned into
> structs or classes- but they're lists at the moment.  Elsewhere I have
> a list that contains only fixnums, each of which relates to a key in
> the hashtable.  I traverse the list, using the fixnums to get the
> right lists from the hash table.  Once generated, this list of indices
> is static, but random access is not needed- only sequential, starting
> from the head.

Is the list of fixnums kept sorted?  If not you do not need it and you
can achieve the same effect by

	(loop for k being the hash-key of *table* using (hash-value v)
	      do (something-with k :and v))

> My specific question is would it be "better" to use references to the
> correct lists rather than indirecting thru the hash table?  I was
> thinking "better" might be measured in memory consumption.  The
> reference approach would certainly execute faster.
> 
> I'm also wondering if "bind" would be a better word in this case than
> "reference"...

I believe you need to abstract things a little.  Do you just have a
set of "items" (your lists) that you want to keep around and just
access them via a key, or do you want to do other operations
(maximize, keep sorted) on this set of items?

Cheers


-- 
Marco Antoniotti ========================================================
NYU Courant Bioinformatics Group	tel. +1 - 212 - 998 3488
719 Broadway 12th Floor                 fax  +1 - 212 - 995 4122
New York, NY 10003, USA			http://bioinformatics.cat.nyu.edu
	       "Hello New York! We'll do what we can!"
			Bill Murray in `Ghostbusters'.

From: Greg Menke
Subject: Re: Relative efficiencies/costs
Date: Fri, 11 May 2001 16:12:59 +0000
Message-ID: <m31ypv9850.fsf@europa.mindspring.com>

> 

> Is the list of fixnums kept sorted?  If not you do not need it and you
> can achieve the same effect by
> 
> 	(loop for k being the hash-key of *table* using (hash-value v)
> 	      do (something-with k :and v))

The ordering of fixnums in the list is important, the application is
storing a document by putting words & other character sequences in a
hashtable, keyed by fixnums.  The document text is represented by a
list of the fixnum keys.  Multiple documents share the dictionary
hashtable.  There is enough intelligence built into the key to avoid
including a key just to represent single spaces.  Contiguous
punctuation and/or whitespace is keyed and stored on the assumption it
occurs infrequently enough to not be excessively wasteful.

I think I could be more clever and use CLOS instead...


> 
> I believe you need to abstract things a little.  Do you just have a
> set of "items" (your lists) that you want to keep around and just
> access them via a key, or do you want to do other operations
> (maximize, keep sorted) on this set of items?

They are kept around only for lookups.

Gregm

From: Pierre R. Mai
Subject: Re: Relative efficiencies/costs
Date: Fri, 11 May 2001 19:24:49 +0000
Message-ID: <87pudfpu2m.fsf@orion.bln.pmsf.de>

Greg Menke <··········@mindspring.com> writes:

> > Is the list of fixnums kept sorted?  If not you do not need it and you
> > can achieve the same effect by
> > 
> > 	(loop for k being the hash-key of *table* using (hash-value v)
> > 	      do (something-with k :and v))
> 
> The ordering of fixnums in the list is important, the application is
> storing a document by putting words & other character sequences in a
> hashtable, keyed by fixnums.  The document text is represented by a
> list of the fixnum keys.  Multiple documents share the dictionary
> hashtable.  There is enough intelligence built into the key to avoid
> including a key just to represent single spaces.  Contiguous
> punctuation and/or whitespace is keyed and stored on the assumption it
> occurs infrequently enough to not be excessively wasteful.

Hmm, this sounds a bit like (the first part of) Lempel-Ziv coding.

Are the fixnum keys so sparse that a hash-table wins over a directly
indexed array for the dictionary?

Since you seem to store some information in the key-value itself, you
probably need to use fixnums as the keys, and can't use pointers to
the phrase strings themselves, right?

Another point to consider is whether you don't really want the
documents to be (specialized) arrays, instead of lists, since long
lists are pretty wasteful both in terms of space, locality of
reference and access-time, especially when compared to specialized
arrays, like arrays of fixnums.

> I think I could be more clever and use CLOS instead...

Hmmm, I don't think that CLOS will buy you much here, this seems more
like an exercise in low-level representation selection, but then again
I don't know what you are doing, exactly...

> They are kept around only for lookups.

Again, if possible, it seems to me that an array of strings, where the
strings are "interned" in the shared dictionaries, and hence shared
across documents, might be the best approach, both in terms of memory
consumption and lookup time.

Regs, Pierre.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein

From: Greg Menke
Subject: Re: Relative efficiencies/costs
Date: Sat, 12 May 2001 00:04:12 +0000
Message-ID: <m3d79fo2kj.fsf@europa.mindspring.com>

> > The ordering of fixnums in the list is important, the application is
> > storing a document by putting words & other character sequences in a
> > hashtable, keyed by fixnums.  The document text is represented by a
> > list of the fixnum keys.  Multiple documents share the dictionary
> > hashtable.  There is enough intelligence built into the key to avoid
> > including a key just to represent single spaces.  Contiguous
> > punctuation and/or whitespace is keyed and stored on the assumption it
> > occurs infrequently enough to not be excessively wasteful.
> 
> Hmm, this sounds a bit like (the first part of) Lempel-Ziv coding.
> 
> Are the fixnum keys so sparse that a hash-table wins over a directly
> indexed array for the dictionary?
> 
> Since you seem to store some information in the key-value itself, you
> probably need to use fixnums as the keys, and can't use pointers to
> the phrase strings themselves, right?
> 
> Another point to consider is whether you don't really want the
> documents to be (specialized) arrays, instead of lists, since long
> lists are pretty wasteful both in terms of space, locality of
> reference and access-time, especially when compared to specialized
> arrays, like arrays of fixnums.

I'm not particularly trying to compress the document, just to
"tokenize" the text into something readily searchable and generate a
few different indices, wordcounts, etc.  The documents could well be
set up as specialized arrays, though the total length is not known
before reading the document.  Converting the list to an array
afterwards is something I'll probably do shortly (unless its possible
to continually & efficiently resize an array).  This would be quite
helpful when it comes to creating a glossary...


> 
> > I think I could be more clever and use CLOS instead...
> 
> Hmmm, I don't think that CLOS will buy you much here, this seems more
> like an exercise in low-level representation selection, but then again
> I don't know what you are doing, exactly...

I just got rid of the fixnums anyhow, going to a shallow class
hierarchy; a base class representing a sequence of text, then a couple
derived classes to represent a "normalized" dictionary word, a
non-word character sequence (includes numbers), and plain words.  The
plain word class has a member bound to the dictionary word it relates
to, each dictionary word has a list of different representations of
the dictionary word.

I'm still using the hash tables, but now they're keyed off the
words/char sequences themselves.  The list representing a document is
now filled with references to the words/char class instances, so
traversing a document requires no hashtable lookup at all.  Its really
much nicer now.  Running interpreted, document loading is much the
same speed as the fixnum approach- possibly faster.  I imagine
traversing the documents will be considerably faster.

typep along with using different classes for words and non-word char
sequences still lets me get rid of the inter-word space overhead.


> > They are kept around only for lookups.
> 
> Again, if possible, it seems to me that an array of strings, where the
> strings are "interned" in the shared dictionaries, and hence shared
> across documents, might be the best approach, both in terms of memory
> consumption and lookup time.

Thats what I'm doing, the hashtables are shared between all documents.
Per-document metadata such as a concordance, wordcounts, indices, etc
are stored in a document class- which contains the list of words as
well.

Gregm