adjustable array vs hash-table

From: Slobodan Blazeski
Subject: adjustable array  vs hash-table
Date: Wed, 30 Jan 2008 22:09:11 +0000
Message-ID: <aa8a230f-01c8-43ef-9932-533d499ee192@d4g2000prg.googlegroups.com>

I could use a hash-table or an adjustable array. Considering the keys
of the hash-table are equal to indices in the array 0.1.2.3.4...
functionality is the same. So which one is more efficient at
a. Expanding when new elements are added when container can't hold
them
b. Accessing elements via index / key.

thanks
Slobodan

Re: adjustable array vs hash-table Xah Lee
- Re: adjustable array vs hash-table Slobodan Blazeski
  - is it true that hash-tables increase their capacity to the next prime number? Slobodan Blazeski
    - Re: is it true that hash-tables increase their capacity to the next prime number? Kaz Kylheku
      - Re: is it true that hash-tables increase their capacity to the next prime number? Harald Hanche-Olsen
- Re: adjustable array vs hash-table Gene
  - Re: adjustable array vs hash-table Xah Lee
    - Re: adjustable array vs hash-table George Neuner
      - Re: adjustable array vs hash-table John <····@csHAXplan9.rit.edu
Re: adjustable array vs hash-table vanekl
Re: adjustable array vs hash-table Pascal Costanza
- Re: adjustable array vs hash-table Jeronimo Pellegrini
Re: adjustable array vs hash-table Joost Diepenmaat
- Re: adjustable array vs hash-table Jeronimo Pellegrini
  - Re: adjustable array vs hash-table Jeronimo Pellegrini

From: Xah Lee
Subject: Re: adjustable array vs hash-table
Date: Wed, 30 Jan 2008 23:01:10 +0000
Message-ID: <f5417bd5-ecb5-443d-98ba-8f9311688c34@d4g2000prg.googlegroups.com>

On Jan 30, 2:09 pm, Slobodan Blazeski <·················@gmail.com>
wrote:
> I could use a hash-table or an adjustable array. Considering the keys
> of the hash-table are equal to indices in the array 0.1.2.3.4...
> functionality is the same. So which one is more efficient at
> a. Expanding when new elements are added when container can't hold
> them
> b. Accessing elements via index / key.

---------------------------------------------
Allow me to rant.

First of all, in high-level langs there should be no such thingy/data-
type as hash-table. Since, as you observed, functionally they are just
array/list/sequence/aggregate (or whichmafuck computer “scientists”
wants to call them).

As you can see, in Mathematica, there is no hash table. There is just
list that is used by what other lang might call vector, sequence,
array, n-tuple, map, set, key'd list, hash, dictionary, cons.

Further, in another recent high-level lang, it has combined array and
hash into one entity, and as the only entity of the sequence kinda
thing. That lang is PHP. For example, in PHP, you create such a linear
sequence like this:

arrays("a","b","c")

where the lang actually created a hash, with keys being 0,1,2. The
above is syntactically equivalent to

arrays(0=>"a",1=>"b",2=>"c")

But what if you want a hash with different keys? just code it so:

arrays("some"=>"a","key2"=>"b","anotherkey"=>"c")

The merging of hash and array into one single language entity has some
peculiar effects. For example, if you have list, and you remove a item
in the mid, you actually end up with a hash with keys that jumps a
integer. So, you no longer go thru the list by incrementing a index
such as i++. So, you effective need to “reindex” it if you intent to
use a for loop for it. (in PHP, there are other lang things that
address this of course. But if you really just want to re-index it,
then just copy it to create another array.)

May this merging at the lang level be good or bad, the important thing
to remember is that all these vector, sequence, array, n-tuple, map,
set, key'd list, hash, dictionary... are mathematically just the same
in the context of programing lang. Their difference is merely how
programers want to use them, and their time-complexity properties on
how they are used. The latter, is due to hard-ware implementation/
pragmatics/speed issues.

In a lang that just have list, you can implement any of these “behind-
the-scenes” stuff on top. For example:

• vector ⇒ never add or delete a element.

• sequence ⇒ just create a list

• array ⇒ depending on what lang you using, this is the same as
sequence, list, or vector.

• n-tuple ⇒ like vector, just create your list with n element and
don't insert or delete elements.

• map ⇒ depending on what lang, this may be just a list, or hash
table.

• hashtable ⇒ a list with element being lists, each contains 2
elements.

• key'd list ⇒ hashtable.

• dictionary ⇒ hashtable.

• set ⇒ a list. Don't ever add to it a item that's already there.

• cons ⇒ a list of 2 elements, nest them into a binary tree, with the
last element of the last node being a special entity (i.e. nil).

-----------------------------

As time matches on, computing hardware resource grows exponentially,
all langs are getting rid of all these special behind-the-scenes
“types” that are necessacitated by hardware/pragmatics/speed issues.

For example, in C, you have none of these. You implement them
yourself. This is about 40 years ago. In Lisp, you have just
essentially just cons and hashtable, with few language facilities on
top that make these easier to use (yet one big stupidity lisp made is
to let the cons surface to the top, and with things like caard,
cddaar). In java, you have almost all of these built-in, together with
the concept of _Java_'s Interfaces, that helps programer see them as
one single entity. In perl, python, you essentially just have a list
and hash table (hashtable is called dictionary in Python). In PHP, you
just have list, but which functions both as sequence and keyd list (it
is called array). In Javascript, only essentially just have the list
(may be off a bit... too lazy to lookup).

The important thing to note here is that, in designing a high-level
comp lang, it is important to seprate, what is significant in the
context of a human-machine protocol for specifying computatiton that
are mathematically meaningful, and separate out issues that are
related to hardware/pragmatics/speed.

In Mathematica, the lang is so high-level, but the price is that the
programs are in general slow, and the lang, as of now, cannot be used
for tasks that needs to access hardware directly. (such as writing a
compiler, OS, device driver)

Similar thing can be said for any other high-level lang, esp PHP,
Javascript. (and to a lesser degree Perl, Python) (that's why these
high-level lang are sometimes called scripting langs)

In my opinion, a solution for a general lang that's high level, is to
have declarations entitites in the lang specifically for the hardware/
pragmatic/speed issues.

This idea isn't new. I think Haskell, Dylan, both have some such
philosophy. (though i don't know these lang so can't be sure exactly
what)

Recently i note in that in Paul Graham's new lang arc, pre-released
today , the cons business lives! What a fucking moronicity.

-------------------------
See also:

• Jargons And High Level Languages
http://xahlee.org/jargons_high_level_lang.html

• Lisp's List Problem
http://xahlee.org/emacs/lisp_list_problem.html

• What is Expressiveness in a Computer Language
http://xahlee.org/perl-python/what_is_expresiveness.html

• The Hack of bitmask used as Boolean Parameters
http://xahlee.org/UnixResource_dir/writ/bitmask.html

  Xah
  ···@xahlee.org
∑ http://xahlee.org/

☄

From: Slobodan Blazeski
Subject: Re: adjustable array vs hash-table
Date: Wed, 30 Jan 2008 23:27:05 +0000
Message-ID: <49316f94-215d-4b72-875e-bcad1fc41045@c23g2000hsa.googlegroups.com>

In the ideal world computers would be so fast that we could do all our
programming with god given single linked list. But considering the
world is less than ideal and  there's a huge difference between
program that answers within seconds and those that needs hours to
compute we have to step down from our pedestal and get our hands dirty
with optimized structures for certain job.
Maybe someday compilers will be so smart to create highly optimized
code by themselves but that day is not today. Mother nature might have
an enourmous time and/or processing power but we're just mortals, we
have to be practical in order to survive. I always start my coding
with plain simple lists, and move to specialized data structures only
if I must to. If a high level languages don't give me any choice my
hands would be bound. So I have to either live with slow code or code
in other language.

cheers
Slobodan
In theory, there is no difference between theory and practice. But, in
practice, there is.
-Jan L. A. van de Snepscheut

From: Slobodan Blazeski
Subject: is it true that hash-tables increase their capacity to the next prime 	number?
Date: Wed, 30 Jan 2008 23:32:43 +0000
Message-ID: <25043b98-155b-4ba9-ac70-cae50be637e8@v17g2000hsa.googlegroups.com>

Thanks to everybody for their answers I will keep both versions, and
after I finish my implementation will bencmark with real data and
implementations.
One more question is it true that hash-tables increase their capacity
to the next prime number?

cheers
Slobodan

From: Kaz Kylheku
Subject: Re: is it true that hash-tables increase their capacity to the next 	prime number?
Date: Thu, 31 Jan 2008 00:06:53 +0000
Message-ID: <31c8a33c-c794-4b43-bd47-f7ff87faf68b@q39g2000hsf.googlegroups.com>

On Jan 30, 3:32 pm, Slobodan Blazeski <·················@gmail.com>
wrote:
> Thanks to everybody for their answers I will keep both versions, and
> after I finish my implementation will bencmark with real data and
> implementations.
> One more question is it true that hash-tables increase their capacity
> to the next prime number?

No. Some algorithms use power-of-two sizes.

The prime number size is associated with the open addressing technique
of handling collisions. With open addressing, the table does not
contain pointers to chains of entries. It contains the entries
themselves. When two or more keys hash to the same table entry, the
conflict is not resolved by starting a linked list, but by allocating
another entry at some displaced location within the table. The probing
search continues around the table until a free entry is found, or
nothing is found. Of course, the same search is used when finding an
existing key. When keys are deleted, they must be replaced by special
``tombstone'' markers; they can't be marked as empty entries, because
that would falsely terminate searches for valid keys that require
further probing.

If the table has a prime number P for its size, you can use an
increment other than 1 to search for empty slots. All integers in the
range 2..P-1 are relatively prime to P, since it is prime. This means
that by searching around the table by any of these increments (modulo
P) will guarantee that you will eventually visit all entries in the
table. For instance suppose the table has size 7 and you use an
increment of 3. You will hit all of the residues modulo 7:  0, 3, 6,
1, 4, 7, 2, 5.  Suppose the table had size 8 and you used increment 3.
That would still work since 8 and 3 are relatively prime. But an
increment of 2 would only hit half the entries! If the hash table has
size 7, any increment (modulo 7) will work, as long as it's not
congruent to 0 (modulo 7). This is linear probing. There is also
quadratic probing, where a quadratic function is used to scan through
the entries. For instance hash(key) + i^2, for i = 0, 1, 2, ...  A
prime table size allows quadratic probing to also visit all of the
nodes.

When collisions are resolved by chaining, using powers of two for the
table sizes has some overwhelming advantages. For instance, reducing a
key to modulo any given table size is just a bitwise AND with the
current bitmask. When a binary table has to grow, each chain's items
will split into exactly two chains, so relocating the items is easy. A
new bit is exposed in each of their keys, and it can be 0 or 1, which
determines where the item will move. You don't have to recompute the
hashing function for anything in the table. Just look at their keys
through a bit mask, and do some list operations.

From: Harald Hanche-Olsen
Subject: Re: is it true that hash-tables increase their capacity to the next  prime number?
Date: Thu, 31 Jan 2008 19:13:31 +0000
Message-ID: <pcor6fxst9g.fsf@shuttle.math.ntnu.no>

+ Kaz Kylheku <········@gmail.com>:

> There is also quadratic probing, where a quadratic function is used
> to scan through the entries. For instance hash(key) + i^2, for i =
> 0, 1, 2, ...  A prime table size allows quadratic probing to also
> visit all of the nodes.

That doesn't sound right to me.  You should get only half of the
nodes, since squaring modulo P is a two-to-one map.  More precisely,
n^2 is congruent to (P-n)^2 modulo P.

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell

From: Gene
Subject: Re: adjustable array vs hash-table
Date: Thu, 31 Jan 2008 03:17:14 +0000
Message-ID: <1f383834-a5f7-4e84-9b66-cea2ea744dd3@v17g2000hsa.googlegroups.com>

On Jan 30, 6:01 pm, Xah Lee <····@xahlee.org> wrote:
> On Jan 30, 2:09 pm, Slobodan Blazeski <·················@gmail.com>
> wrote:
> Allow me to rant.
>
> First of all, in high-level langs there should be no such thingy/data-
> type as hash-table. Since, as you observed, functionally they are just
> array/list/sequence/aggregate (or whichmafuck computer "scientists"
> wants to call them).

Ok.  I hope you feel better.  We don't need your profanity.

Your post is rife with errors--too many to list.  Arrays, lists,
strings, bitmaps, vectors, matrices, tuples, sequences, aggregates
(i.e. structs), hash tables, maps, and dictionaries are all
specialized realizations of _functions_ with varying performance
characteristics.  Languages offer different combinations and
variations depending on their goals.

From: Xah Lee
Subject: Re: adjustable array vs hash-table
Date: Thu, 31 Jan 2008 11:29:15 +0000
Message-ID: <f745a1a5-2e91-4541-98b3-10900e93196f@d21g2000prf.googlegroups.com>

Gene <············@gmail.com> wrote:

«...all specialized realizations of _functions_ with varying
performance characteristics.  Languages offer different combinations
and variations depending on their goals.»

Dear Gene moron,

are you just repeating what i wrote?

i no unstand, why there are so many idiots on newsgroup. Many are
students, many are plainly stupid. But my god, there are just too
many. Many are due to social ignorance. Many are due to testicular
ballisticality.

What is the cause of your idiocy?

  Xah
  ···@xahlee.org
∑ http://xahlee.org/

☄

From: George Neuner
Subject: Re: adjustable array vs hash-table
Date: Thu, 31 Jan 2008 20:28:16 +0000
Message-ID: <dqb4q39b7b1m24soq08ihu46aqg57fgonf@4ax.com>

On Thu, 31 Jan 2008 03:29:15 -0800 (PST), Xah Lee <···@xahlee.org>
wrote:

>Gene <············@gmail.com> wrote:
>
>�...all specialized realizations of _functions_ with varying
>performance characteristics.  Languages offer different combinations
>and variations depending on their goals.�
>
>Dear Gene moron,
>
>are you just repeating what i wrote?
>
>i no unstand, why there are so many idiots on newsgroup. Many are
>students, many are plainly stupid. But my god, there are just too
>many. Many are due to social ignorance. Many are due to testicular
>ballisticality.
>
>What is the cause of your idiocy?
>
>  Xah
>  ···@xahlee.org
>? http://xahlee.org/
>
>?

The only truly ignorant person here is you.

George
--
for email reply remove "/" from address

From: John <····@csHAXplan9.rit.edu
Subject: Re: adjustable array vs hash-table
Date: Thu, 31 Jan 2008 20:35:35 +0000
Message-ID: <47a23116$0$90269$14726298@news.sunsite.dk>

George Neuner wrote:
>On Thu, 31 Jan 2008 03:29:15 -0800 (PST), Xah Lee <···@xahlee.org>
>wrote:
>
>>Gene <············@gmail.com> wrote:
>>
>>�...all specialized realizations of _functions_ with varying
>>performance characteristics.  Languages offer different combinations
>>and variations depending on their goals.�
>>
>>Dear Gene moron,
>>
>>are you just repeating what i wrote?
>>
>>i no unstand, why there are so many idiots on newsgroup. Many are
>>students, many are plainly stupid. But my god, there are just too
>>many. Many are due to social ignorance. Many are due to testicular
>>ballisticality.
>>
>>What is the cause of your idiocy?
>>
>>  Xah
>>  ···@xahlee.org
>>? http://xahlee.org/
>>
>>?
>
>The only truly ignorant person here is you.
>
>George
>--

As a man much more eloquent than this Xah Lee idiot^Wgentleman
once said, "It is a tale told by an idiot, full of sound and fury; 
signifying nothing"


John
--

From: vanekl
Subject: Re: adjustable array vs hash-table
Date: Wed, 30 Jan 2008 23:23:45 +0000
Message-ID: <689e905e-2a43-4009-8311-b34d378edabe@p69g2000hsa.googlegroups.com>

On Jan 30, 10:09 pm, Slobodan Blazeski <·················@gmail.com>
wrote:
> I could use a hash-table or an adjustable array. Considering the keys
> of the hash-table are equal to indices in the array 0.1.2.3.4...
> functionality is the same. So which one is more efficient at
> a. Expanding when new elements are added when container can't hold
> them
> b. Accessing elements via index / key.
>
> thanks
> Slobodan

Adjustable arrays should prove faster, and save memory (unless a
minimal perfect hash function is used).
1. If you can preallocate the expected number of array slots when the
array is first created, you minimize memory reallocations.
2. if your keys are really 0,1,2,3,..., then it sounds like you can
get around having to call a hash function (which saves time).
3. arrays are a little more cache-friendly than hash tables because
data is more localized/dense than if a bunch of pointers are used to
create the hash table.
4. side benefit: arrays give you some sort of ordering that standard
hash tables do not provide.
Lou

From: Pascal Costanza
Subject: Re: adjustable array  vs hash-table
Date: Wed, 30 Jan 2008 22:17:51 +0000
Message-ID: <60cbcfF1penq4U1@mid.individual.net>

Slobodan Blazeski wrote:
> I could use a hash-table or an adjustable array. Considering the keys
> of the hash-table are equal to indices in the array 0.1.2.3.4...
> functionality is the same. So which one is more efficient at
> a. Expanding when new elements are added when container can't hold
> them
> b. Accessing elements via index / key.

Hard to tell, because this is probably implementation-dependent. But 
writing a few simple benchmarks shouldn't be too hard. The predefined 
'time macro is your friend. (Put the benchmark in a function, and time a 
simple call to that function. Putting more complex code inside an 
invocation of time seems to have a negative effect on results.)

Pascal

-- 
1st European Lisp Symposium (ELS'08)
http://prog.vub.ac.be/~pcostanza/els08/

My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/

From: Jeronimo Pellegrini
Subject: Re: adjustable array  vs hash-table
Date: Wed, 30 Jan 2008 23:24:58 +0000
Message-ID: <fnr10a$32f$1@aioe.org>

On 2008-01-30, Pascal Costanza <··@p-cos.net> wrote:
> Slobodan Blazeski wrote:
>> I could use a hash-table or an adjustable array. Considering the keys
>> of the hash-table are equal to indices in the array 0.1.2.3.4...
>> functionality is the same. So which one is more efficient at
>> a. Expanding when new elements are added when container can't hold
>> them
>> b. Accessing elements via index / key.
>
> Hard to tell, because this is probably implementation-dependent.

I have made a little experiment recently... It really is 
implementation-dependent, and the differences between implementations
are striking.

http://aleph0.info/spartns/

The table shows simple benchmarks. In ABCL/Clisp, use hashtables. In
SBCL/GCL, use non-adjustable arrays. I didn't test adjustable arrays...

The file benchmark.lisp has some simple functions...

To the OP: if you install Spartns, you can then change the first
function (the one that works on plain arrays) so the arrays are
adjustable, and see the results.
I did that for SBCL, and it was still 5 times faster than hashtables.
The results will ceratinly be different for other implementations.

J.

From: Joost Diepenmaat
Subject: Re: adjustable array  vs hash-table
Date: Wed, 30 Jan 2008 23:44:01 +0000
Message-ID: <871w7y516m.fsf@zeekat.nl>

Slobodan Blazeski <·················@gmail.com> writes:

> I could use a hash-table or an adjustable array. Considering the keys
> of the hash-table are equal to indices in the array 0.1.2.3.4...

You may consider binary trees, too, if your indeces are sparse (though
from your description, I'm assuming they're not).

> functionality is the same. So which one is more efficient at
> a. Expanding when new elements are added when container can't hold
> them

This probably depends on the implementation, but a natural
implementation of hash tables will probably be an array with some kind
of linked list for the buckets. Which means that growing a hash table
should have about the same overhead as growing an array + the overhead
of redistributing the elements over the buckets. So hash tables are
probably slower to grow. Whether that really makes a difference depends
on how many times you're actually growing.

> b. Accessing elements via index / key.

Hash tables will never be faster than arrays, and arrays are probably a
lot easier to optimize for homogenous simple values too.

Joost.

From: Jeronimo Pellegrini
Subject: Re: adjustable array  vs hash-table
Date: Wed, 30 Jan 2008 23:49:16 +0000
Message-ID: <fnr2dr$6s9$1@aioe.org>

On 2008-01-30, Joost Diepenmaat <·····@zeekat.nl> wrote:
>> b. Accessing elements via index / key.
>
> Hash tables will never be faster than arrays, and arrays are probably a
> lot easier to optimize for homogenous simple values too.

To my surprise, I found that in both ABCL and CLisp hash tables were
faster than arrays. This is probably because these implementations don't
compile array access as efficiently as the others.

(See my other post in this thread)

J.

From: Jeronimo Pellegrini
Subject: Re: adjustable array  vs hash-table
Date: Wed, 30 Jan 2008 23:55:24 +0000
Message-ID: <fnr2pc$6s9$2@aioe.org>

On 2008-01-30, Jeronimo Pellegrini <···@aleph0.info> wrote:
> On 2008-01-30, Joost Diepenmaat <·····@zeekat.nl> wrote:
>>> b. Accessing elements via index / key.
>>
>> Hash tables will never be faster than arrays, and arrays are probably a
>> lot easier to optimize for homogenous simple values too.
>
> To my surprise, I found that in both ABCL and CLisp hash tables were
> faster than arrays. This is probably because these implementations don't
> compile array access as efficiently as the others.
>
> (See my other post in this thread)

Oops. Sorry. I meant "compressed vectors, as I have implemented them" --
not plain arrays.

Ararys are faster in CLisp and ABCL.

J.