How optimized is this string searching method?

From: Emre Sevinc
Subject: How optimized is this string searching method?
Date: Thu, 07 Dec 2006 01:14:44 +0000
Message-ID: <r0l64coeccb.fsf@emba-emre.bilgi.networks>

I have a list of strings sorted alphabetically and it contains about 1500
strings (string lengths vary between 2 - 15 characters). The number of
elements of this string list may grow up to 3000-4000 strings in the future:

CG-USER(19): *alive-entities*
("abanoz" "abi" "abla" "acaip" "adam" "ahtapot" ... )

My purpose is to detect if a given string, e.g. "zebra", is one of the strings
in this list.

So I think I can simply code it as:

(find "zebra" *alive-entities* :test #'string-equal)

Is this acceptaple from the performance point of view?

I have "time"d it as:

CG-USER(21): (time
              (loop for i from 1 upto 1000
                    do (find "zebra" *alive-entities* :test #'string-equal)))
; cpu time (non-gc) 157 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  157 msec user, 0 msec system
; real time  156 msec
; space allocation:
;  17,157 cons cells, 105,344 other bytes, 0 static bytes
NIL

And for 10000 searches it takes about 1,68 seconds
and space allocation of 170,177 cons cells and 1,041,440.

I'd like to know if this performance is pretty awful 
compared to some better way (e.g. different data structure, etc.
or if I have to code my own binary search, etc.)

Regards,

-- 
Emre Sevinc

eMBA Software Developer         Actively engaged in:
http://emba.bilgi.edu.tr        http://ileriseviye.org
http://www.bilgi.edu.tr         http://fazlamesai.net
Cognitive Science Student       http://cazci.com
http://www.cogsci.boun.edu.tr

Re: How optimized is this string searching method? Pascal Bourguignon
- Re: How optimized is this string searching method? Emre Sevinc
  - Re: How optimized is this string searching method? Wolfram Fenske
    - Re: How optimized is this string searching method? Holger Schauer
      - Re: How optimized is this string searching method? Wolfram Fenske
  - Re: How optimized is this string searching method? Pascal Bourguignon
  - Re: How optimized is this string searching method? Madhu
    - Re: How optimized is this string searching method? Emre Sevinc
      - Re: How optimized is this string searching method? George Neuner
      - Re: How optimized is this string searching method? Madhu
        Re: How optimized is this string searching method? Emre Sevinc
        Re: How optimized is this string searching method? Madhu
Re: How optimized is this string searching method? John Thingstad
- Re: How optimized is this string searching method? Emre Sevinc
  - Re: How optimized is this string searching method? Pascal Bourguignon
    - Re: How optimized is this string searching method? John Thingstad
      - Re: How optimized is this string searching method? Pascal Bourguignon

From: Pascal Bourguignon
Subject: Re: How optimized is this string searching method?
Date: Thu, 07 Dec 2006 01:59:26 +0000
Message-ID: <87k614tqip.fsf@thalassa.informatimago.com>

Emre Sevinc <·····@bilgi.edu.tr> writes:

> I have a list of strings sorted alphabetically and it contains about 1500
> strings (string lengths vary between 2 - 15 characters). The number of
> elements of this string list may grow up to 3000-4000 strings in the future:
>
> CG-USER(19): *alive-entities*
> ("abanoz" "abi" "abla" "acaip" "adam" "ahtapot" ... )
>
> My purpose is to detect if a given string, e.g. "zebra", is one of the strings
> in this list.
>
> So I think I can simply code it as:
>
> (find "zebra" *alive-entities* :test #'string-equal)
>
> Is this acceptaple from the performance point of view?
>
> I have "time"d it as:
>
> CG-USER(21): (time
>               (loop for i from 1 upto 1000
>                     do (find "zebra" *alive-entities* :test #'string-equal)))
> ; cpu time (non-gc) 157 msec user, 0 msec system
> ; cpu time (gc)     0 msec user, 0 msec system
> ; cpu time (total)  157 msec user, 0 msec system
> ; real time  156 msec
> ; space allocation:
> ;  17,157 cons cells, 105,344 other bytes, 0 static bytes
> NIL
>
> And for 10000 searches it takes about 1,68 seconds
> and space allocation of 170,177 cons cells and 1,041,440.
>
> I'd like to know if this performance is pretty awful 
> compared to some better way (e.g. different data structure, etc.
> or if I have to code my own binary search, etc.)

Yes, it's pretty bad.


Compare:

(defparameter *words*
 (with-open-file (input "/usr/share/dict/words")
   (loop :repeat 4000 :for word = (read-line input nil nil) :while word :collect word)))
(defparameter *words-h* (make-hash-table :test (function equalp)))
(loop :for word in *words* :do (setf (gethash word *words-h*) word))

(time (loop :repeat 10000 :do (find "zebra" *words* :test (function string-equal))))
(time (loop :repeat 10000 :do (gethash "zebra" *words-h*)))


C/USER[362]> (time (loop :repeat 10000 :do (find "zebra" *words* :test (function string-equal))))
Real time: 21.522385 sec.
Run time: 20.7973 sec.
Space: 4408 Bytes
NIL
C/USER[363]> (time (loop :repeat 10000 :do (gethash "zebra" *words-h*)))
Real time: 0.053166 sec.
Run time: 0.052004 sec.
Space: 4408 Bytes
NIL
C/USER[364]> 

So at least use a hash-table.
With some luck, it might even be faster than a regexp:

(defparameter *words-re* (regexp:regexp-compile 
                           (format nil "^\\(~{~A~^\\|~}\\)$" *words*)
                           :ignore-case t))

(time (loop :repeat 10000 :do (regexp:regexp-exec *words-re* "zebra")))


C/USER[388]> (time (loop :repeat 10000 :do (regexp:regexp-exec *words-re* "zebra")))
Real time: 0.08108 sec.
Run time: 0.072005 sec.
Space: 4408 Bytes
NIL


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
The rule for today:
Touch my tail, I shred your hand.
New rule tomorrow.

From: Emre Sevinc
Subject: Re: How optimized is this string searching method?
Date: Thu, 07 Dec 2006 02:20:27 +0000
Message-ID: <r0l1wnce9as.fsf@emba-emre.bilgi.networks>

Pascal Bourguignon <···@informatimago.com> writes:

> Emre Sevinc <·····@bilgi.edu.tr> writes:
>
>> I have a list of strings sorted alphabetically and it contains about 1500
>> strings (string lengths vary between 2 - 15 characters). The number of
>> elements of this string list may grow up to 3000-4000 strings in the future:
>>
>> CG-USER(19): *alive-entities*
>> ("abanoz" "abi" "abla" "acaip" "adam" "ahtapot" ... )
>>
>> My purpose is to detect if a given string, e.g. "zebra", is one of the strings
>> in this list.
>>
>> So I think I can simply code it as:
>>
>> (find "zebra" *alive-entities* :test #'string-equal)
>>
>> Is this acceptaple from the performance point of view?
>>
>> And for 10000 searches it takes about 1,68 seconds
>> and space allocation of 170,177 cons cells and 1,041,440.
>>
>> I'd like to know if this performance is pretty awful 
>> compared to some better way (e.g. different data structure, etc.
>> or if I have to code my own binary search, etc.)
>
> Yes, it's pretty bad.
>
>
> Compare:
>
> (defparameter *words*
>  (with-open-file (input "/usr/share/dict/words")
>    (loop :repeat 4000 :for word = (read-line input nil nil) :while word :collect word)))
> (defparameter *words-h* (make-hash-table :test (function equalp)))
> (loop :for word in *words* :do (setf (gethash word *words-h*) word))
>
> (time (loop :repeat 10000 :do (find "zebra" *words* :test (function string-equal))))
> (time (loop :repeat 10000 :do (gethash "zebra" *words-h*)))


Thank you very much.

Simply switching to a hash-table structure for that string list data
cut my timing about x10, for now it seems acceptable.

What would bring better results? For example using a vector and implementing
my own binary search on that? Or is there a shorter method to improve
that?



-- 
Emre Sevinc

eMBA Software Developer         Actively engaged in:
http://emba.bilgi.edu.tr        http://ileriseviye.org
http://www.bilgi.edu.tr         http://fazlamesai.net
Cognitive Science Student       http://cazci.com
http://www.cogsci.boun.edu.tr

From: Wolfram Fenske
Subject: Re: How optimized is this string searching method?
Date: Thu, 07 Dec 2006 03:01:44 +0000
Message-ID: <1165460504.919660.299570@j72g2000cwa.googlegroups.com>

Emre Sevinc <·····@bilgi.edu.tr> writes:

> Pascal Bourguignon <···@informatimago.com> writes:
>
>> Emre Sevinc <·····@bilgi.edu.tr> writes:
>>
>>> I have a list of strings sorted alphabetically and it contains
>>> about 1500 strings (string lengths vary between 2 - 15
>>> characters). The number of elements of this string list may grow
>>> up to 3000-4000 strings in the future:

[...]

>> (defparameter *words-h* (make-hash-table :test (function equalp)))
>> (loop :for word in *words* :do (setf (gethash word *words-h*) word))
>>
>> (time (loop :repeat 10000 :do (find "zebra" *words* :test (function string-equal))))
>> (time (loop :repeat 10000 :do (gethash "zebra" *words-h*)))

[...]

> Simply switching to a hash-table structure for that string list data
> cut my timing about x10, for now it seems acceptable.
>
> What would bring better results?

Probably nothing.  Hash tables have an asymptotic complexity of O(1),
i. e. the search time does not depend on the number of items you have
to search.  It doesn't get better than that.  E. g. binary search has
a complexity of O(log n) [1].  If the size of the input is small,
asymptotic complexity isn't a useful inidcator for the speed of an
algorithm because constant factors play a large role.  A binary search
or even a linear search might actually be faster than a hash table.
But several thousand elements isn't small anymore, so hash tables
really should be the fastest solution.

Footnotes:
[1]  If you're not familiar with Big O Notation, you should read up
     on it: <http://en.wikipedia.org/wiki/Big_O_notation>

--
Wolfram Fenske

A: Yes.
>Q: Are you sure?
>>A: Because it reverses the logical flow of conversation.
>>>Q: Why is top posting frowned upon?

From: Holger Schauer
Subject: Re: How optimized is this string searching method?
Date: Thu, 07 Dec 2006 10:07:48 +0000
Message-ID: <yxzwt54f28b.fsf@gmx.de>

On 4845 September 1993, Wolfram Fenske wrote:
> Emre Sevinc <·····@bilgi.edu.tr> writes:
>> Simply switching to a hash-table structure for that string list data
>> cut my timing about x10, for now it seems acceptable.
>> What would bring better results?
> Probably nothing.  Hash tables have an asymptotic complexity of O(1),
> i. e. the search time does not depend on the number of items you have
> to search.

IIRC, there is a tradeoff with using hashes when handling strings,
which is the reason that typically tree-based approaches (red-black
trees, ternary search trees) are used when implementing dictionaries
(which is exactly what Emre is doing here). I've forgotten in which
cases exactly, but at least Wikipedia claims "Red-black trees, along
with AVL trees, offer the best possible worst-case guarantees for
insertion time, deletion time, and search time." (and I think I've
seen similar claims including proofs in various CS books).

I've been working on a ternary-search-tree for some time, using
CLOS. I couldn't beat the built-in hash tables at all, though. I guess
using CLOS wasn't the best choice when looking for speed, let alone
memory usage, but OTOH I haven't heavily tried to optimize it.

Holger

-- 
---          http://hillview.bugwriter.net/            ---
Fachbegriffe der Informatik - Einfach erkl�rt
89: PSD
       Damit die Schriften nicht aussehen, als w�ren sie mit der
       Laubs�ge bearbeitet. (Meikel Katzengreis)

From: Wolfram Fenske
Subject: Re: How optimized is this string searching method?
Date: Thu, 07 Dec 2006 12:21:27 +0000
Message-ID: <1165494087.480644.81310@l12g2000cwl.googlegroups.com>

Holger Schauer <··············@gmx.de> writes:

> On 4845 September 1993, Wolfram Fenske wrote:
>> Emre Sevinc <·····@bilgi.edu.tr> writes:
>>> Simply switching to a hash-table structure for that string list data
>>> cut my timing about x10, for now it seems acceptable.
>>> What would bring better results?
>> Probably nothing.  Hash tables have an asymptotic complexity of O(1),
>> i. e. the search time does not depend on the number of items you have
>> to search.
>
> IIRC, there is a tradeoff with using hashes when handling strings,

Which trade-off would that be?

> which is the reason that typically tree-based approaches (red-black
> trees, ternary search trees) are used when implementing dictionaries

This claim will be hard to verify.  Wait, I see the Wikipedia article
about Red-Black Trees says:

    A red-black tree is a type of self-balancing binary search tree, a
    data structure used in computer science, typically used to
    implement associative arrays.

    (<http://en.wikipedia.org/wiki/Red-black_tree>)

Well, I'm still not convinced.  Besides, it's not quite clear if this
sentence is saying that associative arrays are typically implemented
using self-balancing binary search trees or that a typical use of
self-balancing binary search trees is to implement associative arrays.
Either way, it's only a claim.

[...]

> I've forgotten in which cases exactly, but at least Wikipedia claims
> "Red-black trees, along with AVL trees, offer the best possible
> worst-case guarantees for insertion time, deletion time, and search
> time."

I've just re-read that section.  Maybe they're just talking about
Red-Black Trees in comparison to other binary tree implementations?
At least the preceeding section doesn't mention any other data
structures besides binary trees.

They *might*--although is pure conjecture on my part--also be
referring to the fact that a hash table may have to be resized upon
insertion because it has gotten too full.  Resizing means allocating
new memory and rehashing all of the keys--a pretty expensive
operation.  This circumstance makes insertion have linear worst-case
complexity.  But lookup is still O(1).

> (and I think I've seen similar claims including proofs in various CS
> books).

As far as arguments go, this is pretty weak. :-)

> I've been working on a ternary-search-tree for some time, using
> CLOS. I couldn't beat the built-in hash tables at all, though. I guess
> using CLOS wasn't the best choice when looking for speed, let alone
> memory usage, but OTOH I haven't heavily tried to optimize it.

If you're looking for speed, CLOS is usually not the answer [1].  Its
dynamic nature makes method calls rather slow compared "normal"
function calls.

Footnotes:
[1]  This posting might be of interest:

<http://groups.google.com/group/comp.lang.lisp/browse_frm/thread/84a630cbfe45f2a5/4a7849fb207dfc76?lnk=gst&q=CLOS+monte+carlo&rnum=1#4a7849fb207dfc76>

     In it Kenny argues that implementing a Monte Carlo simulation
     using CLOS isn't such a great idea.

--
Wolfram Fenske

A: Yes.
>Q: Are you sure?
>>A: Because it reverses the logical flow of conversation.
>>>Q: Why is top posting frowned upon?

From: Pascal Bourguignon
Subject: Re: How optimized is this string searching method?
Date: Thu, 07 Dec 2006 02:28:36 +0000
Message-ID: <87r6vc5tij.fsf@thalassa.informatimago.com>

Emre Sevinc <·····@bilgi.edu.tr> writes:

> What would bring better results? For example using a vector and implementing
> my own binary search on that? Or is there a shorter method to improve
> that?

Well depends on the implementation.  The constants are important.   

In clisp, anything you write is slower than using a primitive like
GETHASH.

The regexp ought to be the fastest, with at most (length
string-searched) character compare operations.  

The with hash table, we'd have (length string-searched)  more complex
operations (like one shift plus one xor at least) to compute the hash
value plus some more to index the hash table (including some modulo =
division = slow). Then if there's a collision in the hash table we
fall back to the subsidiary complexity like O(log(n)), with n = size
of the bucket.

But this is entirely theoric. You should test it on real data.

Also, compiling a big regexp takes more time than building the
hash-table: you must take into account the usage pattern, and see if
you can amortize the set up cost over the number of queries.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

NEW GRAND UNIFIED THEORY DISCLAIMER: The manufacturer may
technically be entitled to claim that this product is
ten-dimensional. However, the consumer is reminded that this
confers no legal rights above and beyond those applicable to
three-dimensional objects, since the seven new dimensions are
"rolled up" into such a small "area" that they cannot be
detected.

From: Madhu
Subject: Re: How optimized is this string searching method?
Date: Thu, 07 Dec 2006 11:50:24 +0000
Message-ID: <m3odqfq60v.fsf@robolove.meer.net>

* Emre Sevinc  <···············@emba-emre.bilgi.networks> :
|
| Simply switching to a hash-table structure for that string list data
| cut my timing about x10, for now it seems acceptable.
|
| What would bring better results? For example using a vector and implementing
| my own binary search on that? Or is there a shorter method to improve
| that?

Since this comes often I'm posting part of my trie.lisp file after
slapping an LLGPL on it (in the hope it will be useful). Try the
following code, after switching gethash etc to gettrie, if it makes an
improvement, I can send over the full file.


;;; -*- Mode: LISP; Package: :cl-user; BASE: 10; Syntax: ANSI-Common-Lisp; -*-
;;;
;;;   Touched: Sun Nov 20 15:04:19 2005 +0530 <·······@net.meer>
;;;   Time-stamp: <06/12/07 17:05:53 madhu>
;;;   Bugs-To: ·······@net.meer
;;;
;;; (C) 2005 Madhu. All rights Reserved.
;;;
;;; You are hereby granted the rights to distribute and use this
;;; software as governed by the terms of the Lisp Lesser GNU Public
;;; License (http://opensource.franz.com/preamble.html), known as the
;;; LLGPL.
;;;
;;; This software is distributed in the hope that it will be useful,
;;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
;;;
;;; $Id: mytrie.lisp,v 1.3 2006/08/23 15:36:42 madhu Exp madhu $
;;;
(defpackage "TRIE"
  (:use "CL")
  (:export "GETTRIE" "SETTRIE" "MAPTRIE" "MAKE-TRIE" "REMTRIE"
	   "MATCHTRIE" "TRIE-COMPLETIONS""TRIE-RANDOM-KEY"
	   :count-conses :compress-trie
	   ))
(in-package "TRIE")
(declaim (optimize (speed 3)))

;;; consing newbie version
;;; ----------------------------------------------------------------------
;;; TRIE: (VAL . ALIST)         |  CDR TRIE == (SUB-KEY . SUB-TRIE)+
;;; ALIST: (SUB-KEY . TRIE)+    |
;;; KEY: SUB-KEY+               |
;;;
(defun trie-descend (trie sub-key)
  "Return sub-trie associated with SUB-KEY."
  (let ((cons (assoc sub-key (cdr trie))))
    (when cons (cdr cons))))

(defun trie-get (trie key &optional default)
  "Internal. Second return value is non-NIL if the node for KEY
exists in TRIE."
  (if (null key)
      (values (car trie) T)
      (let ((sub-trie (trie-descend trie (car key))))
	(if sub-trie
	    (trie-get sub-trie (cdr key) default)
	    (values default nil)))))

(defun trie-set (trie key &optional (val T))
  "Internal."
  (if (null key)
      (setf (car trie) val)
      (let ((sub-trie (trie-descend trie (car key))))
	(unless sub-trie
	  (setq sub-trie (cons nil nil))
	  (push (cons (car key) sub-trie) (cdr trie)))
    	(trie-set sub-trie (cdr key) val))))

(defun trie-delete (trie key)
  "Internal."
  (if (null key)
      (prog1 (car trie) (setf (car trie) nil))
      (let ((sub-trie (trie-descend trie (car key))))
	(when sub-trie
	  (multiple-value-prog1
	      (trie-delete sub-trie (cdr key))
	    (unless (or (car sub-trie) (cdr sub-trie))
	      (setf (cdr trie)		; cmu bug in delete
		    (delete sub-trie (cdr trie)  :test #'eq :key #'cdr))))))))

;;;
;;;  PUBLIC FUNCTIONS (FOLLOWING THE HASH TABLE INTERFACE)
;;;

(defun make-trie ()
  (cons nil nil))

(defun gettrie (key trie &optional default)
  "Returns a single value
  (nth-value 0
	     (trie-get trie
		       (etypecase key
			 (cons key)
			 (sequence (coerce key 'list)))
		       default)))

(defsetf gettrie (key trie) (value)
  (let ((key-var (gensym)))
    `(let ((,key-var ,key))
       (trie-set ,trie
		 (etypecase ,key-var
		   (cons ,key-var)
		   (sequence (coerce ,key-var 'list)))
		 ,value))))

(defun remtrie (key trie)
  (trie-delete trie
	       (etypecase key
		 (cons key)
		 (sequence (coerce key 'list)))))

(defun maptrie (map-function trie)
  (labels ((rec (trie &optional prefix key-so-far)
	     (if (null prefix)
		 (;;nconc
		  progn(if (car trie)
			    (list (funcall map-function
					   (reverse key-so-far)
					   (car trie))))
			(if (cdr trie)
			    (loop for (sub-key . sub-trie) in (cdr trie)
				  ;; nconc
				  do (rec sub-trie nil
					     (cons sub-key key-so-far)))))
		 (let ((sub-trie (trie-descend trie (car prefix))))
		   (when sub-trie
		     (rec sub-trie (cdr prefix)
			  (push (car prefix) key-so-far)))))))
    (rec trie)
    (values)))



;;;
;;;
(defun compress-trie (trie &key (+hashmax+ 1031) (+hash0+ 1)
		      (hash-subkey #'char-code)
		      (hash-val #'(lambda (x) (if x 1 0))))
  "Compress TRIE into a new DAG by sharing common accepting
states. Returns the s-exp DAG."  ; uses Huet's hashing scheme
  (let ((table (make-array +hashmax+ :initial-element nil)))
    (labels ((recall (trie-node hashkey)
	       (values (loop for elem in (aref table hashkey)
			     if (equalp trie-node elem) return elem finally
			     (push trie-node (aref table hashkey))
			     (return trie-node))
		       hashkey))
	     (compress-trie-internal (trie)
	       (cond ((endp (cdr trie))
		      (recall trie (funcall hash-val (car trie))))
		     (t (let ((hash +hash0+) sub-tries1) ; linear hash code.
			  (loop for (letter . sub-trie) in (cdr trie) do
				(multiple-value-bind (sub-trie1 hash1)
				    (compress-trie-internal sub-trie)
				  (incf hash (* (funcall hash-subkey letter)
						hash1))
				  (push (cons letter sub-trie1) sub-tries1)))
			  (recall (cons (car trie) (nreverse sub-tries1))
				  (mod (+ hash (funcall hash-val (car trie)))
				       +hashmax+)))))))
      (compress-trie-internal trie))))


(defun count-conses (sexp  &optional (table (make-hash-table :test #'eq))
		     (count 0))
  "Return as values the number of cons cells and the number of
distinct cons cells in SEXP."
  (values (cond ((endp sexp) count)
		(t (incf (gethash sexp table 0))
		   (if (consp (car sexp))
		       (count-conses (car sexp) table
				     (+ count 1
					(count-conses (cdr sexp) table)))
		       (count-conses (cdr sexp) table (1+ count)))))
	  (hash-table-count table)))

From: Emre Sevinc
Subject: Re: How optimized is this string searching method?
Date: Thu, 07 Dec 2006 15:22:48 +0000
Message-ID: <r0lslfrd92v.fsf@emba-emre.bilgi.networks>

Madhu <·······@meer.net> writes:

> * Emre Sevinc  <···············@emba-emre.bilgi.networks> :
> |
> | Simply switching to a hash-table structure for that string list data
> | cut my timing about x10, for now it seems acceptable.
> |
> | What would bring better results? For example using a vector and implementing
> | my own binary search on that? Or is there a shorter method to improve
> | that?
>
> Since this comes often I'm posting part of my trie.lisp file after
> slapping an LLGPL on it (in the hope it will be useful). Try the
> following code, after switching gethash etc to gettrie, if it makes an
> improvement, I can send over the full file.

I have tried your "trie" package for the same set of strings for 
1000 and 10000 searches:

(defparameter *alive-entities-trie* (make-trie))

(with-open-file (stream "strings.txt")
  (loop for line = (read-line stream nil)
      until (eq line nil)
      do (setf (gettrie line *alive-entities-trie*) line)))




(time (loop :repeat 1000 do (gettrie "zebra" *alive-entities-trie*)))

; cpu time (non-gc) 32 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  32 msec user, 0 msec system
; real time  32 msec
; space allocation:
;  21,108 cons cells, 1,376 other bytes, 0 static bytes


(time (loop :repeat 10000 do (gettrie "zebra" *alive-entities-trie*)))

; cpu time (non-gc) 205 msec user, 0 msec system
; cpu time (gc)     30 msec user, 0 msec system
; cpu time (total)  235 msec user, 0 msec system
; real time  235 msec
; space allocation:
;  210,109 cons cells, 1,504 other bytes, 0 static bytes


On the other hand, when I did the similar searches with hash:


(TIME (LOOP :REPEAT 1000 DO (GETHASH "zebra" *ALIVE-ENTITIES*)))
; cpu time (non-gc) 16 msec user, 0 msec system
; cpu time (gc)     16 msec user, 0 msec system
; cpu time (total)  32 msec user, 0 msec system
; real time  31 msec
; space allocation:
;  16,108 cons cells, 73,472 other bytes, 0 static bytes


(TIME (LOOP :REPEAT 10000 DO (GETHASH "zebra" *ALIVE-ENTITIES*)))
; cpu time (non-gc) 172 msec user, 0 msec system
; cpu time (gc)     31 msec user, 0 msec system
; cpu time (total)  203 msec user, 0 msec system
; real time  203 msec
; space allocation:
;  160,109 cons cells, 721,560 other bytes, 0 static bytes


It seems like your data structure and algorithm is slightly slower for my
data at hand but in terms of "other bytes" used (I don't know exactly what
that means) your package uses much less than the hash alternative (but your
package uses more cons cells compared to hash version).

BTW, these results are from a 2.40 Ghz P-IV with 512 MB RAM,
running MS Windows 2000 Server and Franz Inc.'s Allegro Express Edition 8.0.


-- 
Emre Sevinc

eMBA Software Developer         Actively engaged in:
http://emba.bilgi.edu.tr        http://ileriseviye.org
http://www.bilgi.edu.tr         http://fazlamesai.net
Cognitive Science Student       http://cazci.com
http://www.cogsci.boun.edu.tr

From: George Neuner
Subject: Re: How optimized is this string searching method?
Date: Thu, 07 Dec 2006 20:56:47 +0000
Message-ID: <6qvgn21uumm7to34ivpe63tltikm9a0bc4@4ax.com>

On Thu, 07 Dec 2006 17:22:48 +0200, Emre Sevinc <·····@bilgi.edu.tr>
wrote:

>Madhu <·······@meer.net> writes:
>
>> * Emre Sevinc  <···············@emba-emre.bilgi.networks> :
>> |
>> | Simply switching to a hash-table structure for that string list data
>> | cut my timing about x10, for now it seems acceptable.
>> |
>> | What would bring better results? For example using a vector and implementing
>> | my own binary search on that? Or is there a shorter method to improve
>> | that?
>>
>> Since this comes often I'm posting part of my trie.lisp file after
>> slapping an LLGPL on it (in the hope it will be useful). Try the
>> following code, after switching gethash etc to gettrie, if it makes an
>> improvement, I can send over the full file.
>
>I have tried your "trie" package for the same set of strings for 
>1000 and 10000 searches:
>
>(defparameter *alive-entities-trie* (make-trie))
>
>(with-open-file (stream "strings.txt")
>  (loop for line = (read-line stream nil)
>      until (eq line nil)
>      do (setf (gettrie line *alive-entities-trie*) line)))
>
>
>
>
>(time (loop :repeat 1000 do (gettrie "zebra" *alive-entities-trie*)))
>
>; cpu time (non-gc) 32 msec user, 0 msec system
>; cpu time (gc)     0 msec user, 0 msec system
>; cpu time (total)  32 msec user, 0 msec system
>; real time  32 msec
>; space allocation:
>;  21,108 cons cells, 1,376 other bytes, 0 static bytes
>
>
>(time (loop :repeat 10000 do (gettrie "zebra" *alive-entities-trie*)))
>
>; cpu time (non-gc) 205 msec user, 0 msec system
>; cpu time (gc)     30 msec user, 0 msec system
>; cpu time (total)  235 msec user, 0 msec system
>; real time  235 msec
>; space allocation:
>;  210,109 cons cells, 1,504 other bytes, 0 static bytes
>
>
>On the other hand, when I did the similar searches with hash:
>
>
>(TIME (LOOP :REPEAT 1000 DO (GETHASH "zebra" *ALIVE-ENTITIES*)))
>; cpu time (non-gc) 16 msec user, 0 msec system
>; cpu time (gc)     16 msec user, 0 msec system
>; cpu time (total)  32 msec user, 0 msec system
>; real time  31 msec
>; space allocation:
>;  16,108 cons cells, 73,472 other bytes, 0 static bytes
>
>
>(TIME (LOOP :REPEAT 10000 DO (GETHASH "zebra" *ALIVE-ENTITIES*)))
>; cpu time (non-gc) 172 msec user, 0 msec system
>; cpu time (gc)     31 msec user, 0 msec system
>; cpu time (total)  203 msec user, 0 msec system
>; real time  203 msec
>; space allocation:
>;  160,109 cons cells, 721,560 other bytes, 0 static bytes
>
>
>It seems like your data structure and algorithm is slightly slower for my
>data at hand but in terms of "other bytes" used (I don't know exactly what
>that means) your package uses much less than the hash alternative (but your
>package uses more cons cells compared to hash version).
>
>BTW, these results are from a 2.40 Ghz P-IV with 512 MB RAM,
>running MS Windows 2000 Server and Franz Inc.'s Allegro Express Edition 8.0.

One other advantage of a trie is that words which are prefixes of
other words can also be found with no increase in the storage
requirements.

George
--
for email reply remove "/" from address

From: Madhu
Subject: Re: How optimized is this string searching method?
Date: Fri, 08 Dec 2006 09:34:58 +0000
Message-ID: <m3r6vaohml.fsf@robolove.meer.net>

Helu
* Emre Sevinc <···············@emba-emre.bilgi.networks> :
| It seems like your data structure and algorithm is slightly slower for my
| data at hand but in terms of "other bytes" used (I don't know exactly what
| that means) your package uses much less than the hash alternative (but your
| package uses more cons cells compared to hash version).
|
| BTW, these results are from a 2.40 Ghz P-IV with 512 MB RAM,
| running MS Windows 2000 Server and Franz Inc.'s Allegro Express Edition 8.0.

Cool, thanks. (as mentioned), for your configuration/problem, hash
tables are probably sufficient.  However if you still have the test
setup handy, perhaps you could compile the following function and
rerun it. (The data structure is very simple -- it just uses cons cells)

(defun gettrie (string trie &optional default)
  "Limited version. STRING is a character string. Returns a single value"
  (declare (type string string) (type list trie))
  (loop for c of-type character across string
	for sub-trie of-type list = (trie-descend trie c)
	then (trie-descend sub-trie c)
	if (endp sub-trie) return default
	finally (return (if sub-trie (car sub-trie) default))))

--
Madhu

From: Emre Sevinc
Subject: Re: How optimized is this string searching method?
Date: Fri, 08 Dec 2006 11:58:21 +0000
Message-ID: <r0l7ix2sioy.fsf@emba-emre.bilgi.networks>

Madhu <·······@meer.net> writes:

> Helu
> * Emre Sevinc <···············@emba-emre.bilgi.networks> :
> | It seems like your data structure and algorithm is slightly slower for my
> | data at hand but in terms of "other bytes" used (I don't know exactly what
> | that means) your package uses much less than the hash alternative (but your
> | package uses more cons cells compared to hash version).
> |
> | BTW, these results are from a 2.40 Ghz P-IV with 512 MB RAM,
> | running MS Windows 2000 Server and Franz Inc.'s Allegro Express Edition 8.0.
>
> Cool, thanks. (as mentioned), for your configuration/problem, hash
> tables are probably sufficient.  However if you still have the test
> setup handy, perhaps you could compile the following function and
> rerun it. (The data structure is very simple -- it just uses cons cells)
>
> (defun gettrie (string trie &optional default)
>   "Limited version. STRING is a character string. Returns a single value"
>   (declare (type string string) (type list trie))
>   (loop for c of-type character across string
> 	for sub-trie of-type list = (trie-descend trie c)
> 	then (trie-descend sub-trie c)
> 	if (endp sub-trie) return default
> 	finally (return (if sub-trie (car sub-trie) default))))


I did the tests for the same set of data, same PC and Common Lisp compiler,
here are the results for 1000 and 10000 searches:

(time (loop :repeat 1000 do (gettrie "zebra" *alive-entities-trie*)))

; cpu time (non-gc) 15 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  15 msec user, 0 msec system
; real time  1,625 msec
; space allocation:
;  16,108 cons cells, 1,624 other bytes, 0 static bytes



(time (loop :repeat 10000 do (gettrie "zebra" *alive-entities-trie*)))

; cpu time (non-gc) 156 msec user, 0 msec system
; cpu time (gc)     16 msec user, 0 msec system
; cpu time (total)  172 msec user, 0 msec system
; real time  203 msec
; space allocation:
;  160,113 cons cells, 1,504 other bytes, 0 static 



Regards,

-- 
Emre Sevinc

eMBA Software Developer         Actively engaged in:
http://emba.bilgi.edu.tr        http://ileriseviye.org
http://www.bilgi.edu.tr         http://fazlamesai.net
Cognitive Science Student       http://cazci.com
http://www.cogsci.boun.edu.tr

From: Madhu
Subject: Re: How optimized is this string searching method?
Date: Fri, 08 Dec 2006 13:44:51 +0000
Message-ID: <m31wna7b8s.fsf@robolove.meer.net>

* Emre Sevinc <···············@emba-emre.bilgi.networks> :
| I did the tests for the same set of data, same PC and Common Lisp compiler,
| here are the results for 1000 and 10000 searches:

Thanks. I think this is now competetive and twice as fast (if you see
user realtime) on this simple test (and i'm sure probably rather
misleading!)  test:

|
| (time (loop :repeat 1000 do (gettrie "zebra" *alive-entities-trie*)))
|
| ; cpu time (non-gc) 15 msec user, 0 msec system
| ; cpu time (gc)     0 msec user, 0 msec system
| ; cpu time (total)  15 msec user, 0 msec system
| ; real time  1,625 msec
| ; space allocation:
| ;  16,108 cons cells, 1,624 other bytes, 0 static bytes

(TIME (LOOP :REPEAT 1000 DO (GETHASH "zebra" *ALIVE-ENTITIES*)))
; cpu time (non-gc) 16 msec user, 0 msec system
; cpu time (gc)     16 msec user, 0 msec system
; cpu time (total)  32 msec user, 0 msec system
; real time  31 msec
; space allocation:
;  16,108 cons cells, 73,472 other bytes, 0 static bytes

| (time (loop :repeat 10000 do (gettrie "zebra" *alive-entities-trie*)))
|
| ; cpu time (non-gc) 156 msec user, 0 msec system
| ; cpu time (gc)     16 msec user, 0 msec system
| ; cpu time (total)  172 msec user, 0 msec system
| ; real time  203 msec
| ; space allocation:
| ;  160,113 cons cells, 1,504 other bytes, 0 static 
|
|

(TIME (LOOP :REPEAT 10000 DO (GETHASH "zebra" *ALIVE-ENTITIES*)))
; cpu time (non-gc) 172 msec user, 0 msec system
; cpu time (gc)     31 msec user, 0 msec system
; cpu time (total)  203 msec user, 0 msec system
; real time  203 msec
; space allocation:
;  160,109 cons cells, 721,560 other bytes, 0 static bytes


--
Madhu

From: John Thingstad
Subject: Re: How optimized is this string searching method?
Date: Thu, 07 Dec 2006 02:13:37 +0000
Message-ID: <op.tj524znzpqzri1@pandora.upc.no>

On Thu, 07 Dec 2006 02:14:44 +0100, Emre Sevinc <·····@bilgi.edu.tr> wrote:

>
> I'd like to know if this performance is pretty awful
> compared to some better way (e.g. different data structure, etc.
> or if I have to code my own binary search, etc.)
>
> Regards,
>

Well obviously a linear search is the slowest O(n/2) average where n is  
the number of elements.

A faster solution is a binary search with O(log2 n). It is trivial to  
write.

The fastest way is to use a hash table O(1). That's right, one lookup and  
it is there.
Now a hash-table provides a bit of overhead. Slightly more involved to  
implement.

Which is best for you I don't know.
Sounds to me that for one user linear search gives acceptable performance.
If several hundred users are searching the array of course things look  
different.

In general get the thing working first. Make it simple.
Focus on the basic functionality. The work-flow of the application.
You can add all the nice details later.
If performance is slow use a profiler to isolate where the program is  
using time
and fix these.

Often things like I/O, disk, screen etc. are the real killers so fixing  
the search might
have little or no effect. Test don't guess. And if it's fast enough it's  
fast enough.
Don't optimize unless you have to. Clarity and ease of understanding the  
code are usually
more important anyhow. (Overzealous optimizers tend to produce buggy  
code..)

With a basic working application now add new features one by one and test  
as you go along.

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

From: Emre Sevinc
Subject: Re: How optimized is this string searching method?
Date: Thu, 07 Dec 2006 02:27:39 +0000
Message-ID: <r0lwt54cuec.fsf@emba-emre.bilgi.networks>

"John Thingstad" <··············@chello.no> writes:

> On Thu, 07 Dec 2006 02:14:44 +0100, Emre Sevinc <·····@bilgi.edu.tr> wrote:
>
>>
>> I'd like to know if this performance is pretty awful
>> compared to some better way (e.g. different data structure, etc.
>> or if I have to code my own binary search, etc.)
>>
>> Regards,
>>
>
> Well obviously a linear search is the slowest O(n/2) average where n
> is  the number of elements.
>
> A faster solution is a binary search with O(log2 n). It is trivial to
> write.
>
> The fastest way is to use a hash table O(1). That's right, one lookup
> and  it is there.

Hmm, so my previous suggestion to Pascal's answer is wrong, implementing
a binary search (using string-equal) on a vector is not going to be
faster than that hash-table structure since it just retrieves the
required element based on its index (hash value). Thanks for clarification.

> Now a hash-table provides a bit of overhead. Slightly more involved to
> implement.
>
> Which is best for you I don't know.
> Sounds to me that for one user linear search gives acceptable performance.
> If several hundred users are searching the array of course things look
> different.

No, this is a simple part of some kind of an NLP application that rather
works in a batch processing way, single user mode.


> In general get the thing working first. Make it simple.
> Focus on the basic functionality. The work-flow of the application.
> You can add all the nice details later.
> If performance is slow use a profiler to isolate where the program is
> using time and fix these.
> Often things like I/O, disk, screen etc. are the real killers so
> fixing  the search might  have little or no effect. Test don't guess. 
> And if it's fast enough it's  fast enough.
> Don't optimize unless you have to. Clarity and ease of understanding
> the  code are usually more important anyhow. (Overzealous optimizers tend to 
> produce buggy code..)

I agree with you (and Knuth) on these optimization issues. For the
situation at hand I don't think converting to a hash-table brings
overcomplexity to the code since it is already a very very basic 
lookup procedure.



-- 
Emre Sevinc

eMBA Software Developer         Actively engaged in:
http://emba.bilgi.edu.tr        http://ileriseviye.org
http://www.bilgi.edu.tr         http://fazlamesai.net
Cognitive Science Student       http://cazci.com
http://www.cogsci.boun.edu.tr

From: Pascal Bourguignon
Subject: Re: How optimized is this string searching method?
Date: Thu, 07 Dec 2006 02:46:21 +0000
Message-ID: <87mz605soy.fsf@thalassa.informatimago.com>

Emre Sevinc <·····@bilgi.edu.tr> writes:

> "John Thingstad" <··············@chello.no> writes:
>
>> On Thu, 07 Dec 2006 02:14:44 +0100, Emre Sevinc <·····@bilgi.edu.tr> wrote:
>>
>>>
>>> I'd like to know if this performance is pretty awful
>>> compared to some better way (e.g. different data structure, etc.
>>> or if I have to code my own binary search, etc.)
>>>
>>> Regards,
>>>
>>
>> Well obviously a linear search is the slowest O(n/2) average where n
>> is  the number of elements.
>>
>> A faster solution is a binary search with O(log2 n). It is trivial to
>> write.
>>
>> The fastest way is to use a hash table O(1). That's right, one lookup
>> and  it is there.
>
> Hmm, so my previous suggestion to Pascal's answer is wrong, implementing
> a binary search (using string-equal) on a vector is not going to be
> faster than that hash-table structure since it just retrieves the
> required element based on its index (hash value). Thanks for clarification.

Indeed, but doing a search in a tree based on each character may be
faster (eg like a DFA) only that a regexp library might be still
better optimized.

(loop :for i :from 0 :below (length string)
      :for (got-word . current-table) 
            = (aref *root*        (char-code (aref string i)))
        :then (aref current-table (char-code (aref string i)))
      :until (null current-table)
      :finally (return (and (= i (length string)) got-word)))


The gethash needs to do something like:

(let ((buck (aref
               (hash-table-buckets hash-table)
               (mod (loop :for i :from 0 :below (length string) ; it might cheat 
                          ; and use less than length string characters, but it 
                          ; might get more  collisions.
                          :for n = 1 :then (mod (+ 5 i) 27)
                          :for h = i :then (logxor h 
                                              (ash (char-code (aref string i)) n))
                          :finally (return h))
                    (length (hash-table-buckets hash-table))))))
  (if (first buck)
      (if (funcall (hash-table-test hash-table) key (second buck))
          (values (third buck) t)
          (values nil nil))
      (binary-search key (second buck))))

Yes, it's still O(1) with respect to the number of entries in the hash
table, but the constants with respect to the length of the key are bigger.      

On the other hand, it might use less memory than the vectors used
above, so it might still be faster if it hasn't to refresh L1 cache
lines.  Or not.  It depends on the implementation, and the processor,
and the application memory usage pattern.  Theorically we cannot say
anything. YOU MUST TEST THE VARIOUS DATA STRUCTURES ON REAL DATA!  (of
course, after having profiled the application to be sure that this is
the bottleneck).


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
Until real software engineering is developed, the next best practice
is to develop with a dynamic system that has extreme late binding in
all aspects. The first system to really do this in an important way
is Lisp. -- Alan Kay

From: John Thingstad
Subject: Re: How optimized is this string searching method?
Date: Thu, 07 Dec 2006 04:36:50 +0000
Message-ID: <op.tj59rohxpqzri1@pandora.upc.no>

On Thu, 07 Dec 2006 03:46:21 +0100, Pascal Bourguignon  
<···@informatimago.com> wrote:

Gnarly code.. I feel a bug sneaking in ;)
(Obviously, you are better at this than me. But honestly that seems  
excessively complex.)

>
> On the other hand, it might use less memory than the vectors used
> above, so it might still be faster if it hasn't to refresh L1 cache
> lines.  Or not.  It depends on the implementation, and the processor,
> and the application memory usage pattern.  Theorically we cannot say
> anything. YOU MUST TEST THE VARIOUS DATA STRUCTURES ON REAL DATA!  (of
> course, after having profiled the application to be sure that this is
> the bottleneck).
>
>

Good point. Should have mentioned that.

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

From: Pascal Bourguignon
Subject: Re: How optimized is this string searching method?
Date: Thu, 07 Dec 2006 05:50:31 +0000
Message-ID: <87irgo5k60.fsf@thalassa.informatimago.com>

"John Thingstad" <··············@chello.no> writes:

> On Thu, 07 Dec 2006 03:46:21 +0100, Pascal Bourguignon
> <···@informatimago.com> wrote:
>
> Gnarly code.. I feel a bug sneaking in ;)

PSEUDO-code ;-)

> (Obviously, you are better at this than me. But honestly that seems
> excessively complex.)

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

PUBLIC NOTICE AS REQUIRED BY LAW: Any use of this product, in any
manner whatsoever, will increase the amount of disorder in the
universe. Although no liability is implied herein, the consumer is
warned that this process will ultimately lead to the heat death of
the universe.