From: dpapathanasiou
Subject: Low-level, Efficient Ternary Search Trees for Storing Strings?
Date: 
Message-ID: <1175202616.126251.134340@o5g2000hsb.googlegroups.com>
After reading Bentley and Sedgewick's paper on using Ternary Search
Trees for storing strings (http://www.ddj.com/dept/windows/184410528),
I wrote an implementation in CL, using a simple char array as the trie
structure.

It works, and searching is fast, but the resulting tries (arrays) are
massive, especially when saved to disk.

Then I saw this post about  packing multiple datapoints into a single
object -- fixnum, in his case, since he's dealing with integers
(http://groups.google.com/group/comp.lang.lisp/browse_thread/thread/
f5b364a601c86987), and I was wondering if there's any similar low-
level, efficient structure that can be used to replace the char array
in my code?

Here's my code:

(defpackage :trie
  (:use
   :common-lisp)
  (:export
   :make-trie
   :insert-trie
   :search-trie)
  (:documentation
   "Library of ternary trees, inspired by Bentley & Sedgewick
article."))

(in-package :trie)

(defun make-trie ()
  "Return a trie object, an array, defined as:
  0 = split-char
  1 = lo-kid
  2 = eq-kid
  3 = hi-kid"
  (make-array '(4) :element-type 'char :initial-contents '(nil nil nil
nil)))

(defun get-trie-val (trie index)
  "Access the trie value defined by index:
  0 = split-char
  1 = lo-kid
  2 = eq-kid
  3 = hi-kid"
  (aref trie index))

(defun get-split-char (trie)
  (get-trie-val trie 0))

(defun get-lo-kid (trie)
  (get-trie-val trie 1))

(defun get-eq-kid (trie)
  (get-trie-val trie 2))

(defun get-hi-kid (trie)
  (get-trie-val trie 3))

(defun set-trie-val (trie index val)
  "Set the trie value defined by index:
  0 = split-char
  1 = lo-kid
  2 = eq-kid
  3 = hi-kid"
  (setf (aref trie index) val))

(defun set-split-char (trie val)
  (set-trie-val trie 0 val))

(defun set-lo-kid (trie val)
  (set-trie-val trie 1 val))

(defun set-eq-kid (trie val)
  (set-trie-val trie 2 val))

(defun set-hi-kid (trie val)
  (set-trie-val trie 3 val))

(defun trie-empty (trie)
  "A trie is empty if its split-char value is undefined."
  (null (get-split-char trie)))

(defun insert-trie (trie str)
  "Add the string to the trie."
  (when (> (length str) 0)
    (let ((c (char-downcase (char str 0))))
      (when (trie-empty trie)
        (setf trie (make-trie))
        (set-split-char trie c)
        (set-lo-kid trie (make-trie))
        (set-eq-kid trie (make-trie))
        (set-hi-kid trie (make-trie)))
      (cond ((char< c (get-split-char trie))
             (set-lo-kid trie (insert-trie (get-lo-kid trie) str)))
            ((char= c (get-split-char trie))
             (set-eq-kid trie (insert-trie (get-eq-kid trie) (subseq
str 1))))
            (t
             (set-hi-kid trie (insert-trie (get-hi-kid trie) str))))))
  trie)

(defun search-trie (trie str)
  "Return t if the string exists in the trie, nil if it does not."
  (if (> (length str) 0)
    (let ((c (char-downcase (char str 0))))
      (cond ((null (get-split-char trie))
             nil) ; run off end w/o match
            ((char< c (get-split-char trie))
             (search-trie (get-lo-kid trie) str))
            ((char> c (get-split-char trie))
             (search-trie (get-hi-kid trie) str))
            (t
             (search-trie (get-eq-kid trie) (subseq str 1)))))
    t))

From: Rainer Joswig
Subject: Re: Low-level, Efficient Ternary Search Trees for Storing Strings?
Date: 
Message-ID: <joswig-1E03F4.23345629032007@news-europe.giganews.com>
In article <························@o5g2000hsb.googlegroups.com>,
 "dpapathanasiou" <···················@gmail.com> wrote:

> After reading Bentley and Sedgewick's paper on using Ternary Search
> Trees for storing strings (http://www.ddj.com/dept/windows/184410528),
> I wrote an implementation in CL, using a simple char array as the trie
> structure.
> 
> It works, and searching is fast, but the resulting tries (arrays) are
> massive, especially when saved to disk.
> 
> Then I saw this post about  packing multiple datapoints into a single
> object -- fixnum, in his case, since he's dealing with integers
> (http://groups.google.com/group/comp.lang.lisp/browse_thread/thread/
> f5b364a601c86987), and I was wondering if there's any similar low-
> level, efficient structure that can be used to replace the char array
> in my code?
> 
> Here's my code:
> 
> (defpackage :trie
>   (:use
>    :common-lisp)
>   (:export
>    :make-trie
>    :insert-trie
>    :search-trie)
>   (:documentation
>    "Library of ternary trees, inspired by Bentley & Sedgewick
> article."))
> 
> (in-package :trie)
> 
> (defun make-trie ()
>   "Return a trie object, an array, defined as:
>   0 = split-char
>   1 = lo-kid
>   2 = eq-kid
>   3 = hi-kid"
>   (make-array '(4) :element-type 'char :initial-contents '(nil nil nil
> nil)))

That's not right. You have an array of characters.
NIL is not a character. You can say element type is
character and then initialize the elements to NIL
(which is not a character).

Plus:

Element 0 is a character. What are elements 1, 2 and 3?
They are others TRIEs. Not characters. You
can use a general ARRAY, but not a character ARRAY.

Use a structure instead of an array. See below.


> 
> (defun get-trie-val (trie index)
>   "Access the trie value defined by index:
>   0 = split-char
>   1 = lo-kid
>   2 = eq-kid
>   3 = hi-kid"
>   (aref trie index))
> 
> (defun get-split-char (trie)
>   (get-trie-val trie 0))
> 
> (defun get-lo-kid (trie)
>   (get-trie-val trie 1))
> 
> (defun get-eq-kid (trie)
>   (get-trie-val trie 2))
> 
> (defun get-hi-kid (trie)
>   (get-trie-val trie 3))
> 
> (defun set-trie-val (trie index val)
>   "Set the trie value defined by index:
>   0 = split-char
>   1 = lo-kid
>   2 = eq-kid
>   3 = hi-kid"
>   (setf (aref trie index) val))

SETF is better used to update data. You would
also want to write your interfaces to provide SETF
functions. You can do it with (defun (setf ... )...).
But usully you get it with DEFSTRUCT for free. See below.

> 
> (defun set-split-char (trie val)
>   (set-trie-val trie 0 val))
> 
> (defun set-lo-kid (trie val)
>   (set-trie-val trie 1 val))
> 
> (defun set-eq-kid (trie val)
>   (set-trie-val trie 2 val))
> 
> (defun set-hi-kid (trie val)
>   (set-trie-val trie 3 val))

A remark: all this boilerplate code could be replaced
with a single DEFSCTRUCT

(defstruct trie ...)

Then you'd get a structure object when you call MAKE-TREE.
Note that you can also use DEFSTRUCT when you want
an array as the underlying datastructure. There is a :TYPE
option to DEFSTRUCT where you can specify that.

You could also use DEFCLASS if you'd like the flexibility of CLOS
classes.


> 
> (defun trie-empty (trie)
>   "A trie is empty if its split-char value is undefined."
>   (null (get-split-char trie)))
> 
> (defun insert-trie (trie str)
>   "Add the string to the trie."
>   (when (> (length str) 0)
>     (let ((c (char-downcase (char str 0))))
>       (when (trie-empty trie)
>         (setf trie (make-trie))
>         (set-split-char trie c)
>         (set-lo-kid trie (make-trie))
>         (set-eq-kid trie (make-trie))
>         (set-hi-kid trie (make-trie)))
>       (cond ((char< c (get-split-char trie))
>              (set-lo-kid trie (insert-trie (get-lo-kid trie) str)))
>             ((char= c (get-split-char trie))
>              (set-eq-kid trie (insert-trie (get-eq-kid trie) (subseq
> str 1))))
>             (t
>              (set-hi-kid trie (insert-trie (get-hi-kid trie) str))))))
>   trie)
> 
> (defun search-trie (trie str)
>   "Return t if the string exists in the trie, nil if it does not."
>   (if (> (length str) 0)
>     (let ((c (char-downcase (char str 0))))
>       (cond ((null (get-split-char trie))
>              nil) ; run off end w/o match
>             ((char< c (get-split-char trie))
>              (search-trie (get-lo-kid trie) str))
>             ((char> c (get-split-char trie))
>              (search-trie (get-hi-kid trie) str))
>             (t
>              (search-trie (get-eq-kid trie) (subseq str 1)))))
>     t))

-- 
http://lispm.dyndns.org
From: Rainer Joswig
Subject: Re: Low-level, Efficient Ternary Search Trees for Storing Strings?
Date: 
Message-ID: <joswig-AAB747.23375329032007@news-europe.giganews.com>
In article <····························@news-europe.giganews.com>,
 Rainer Joswig <······@lisp.de> wrote:

> In article <························@o5g2000hsb.googlegroups.com>,
>  "dpapathanasiou" <···················@gmail.com> wrote:
> 
> > After reading Bentley and Sedgewick's paper on using Ternary Search
> > Trees for storing strings (http://www.ddj.com/dept/windows/184410528),
> > I wrote an implementation in CL, using a simple char array as the trie
> > structure.
> > 
> > It works, and searching is fast, but the resulting tries (arrays) are
> > massive, especially when saved to disk.
> > 
> > Then I saw this post about  packing multiple datapoints into a single
> > object -- fixnum, in his case, since he's dealing with integers
> > (http://groups.google.com/group/comp.lang.lisp/browse_thread/thread/
> > f5b364a601c86987), and I was wondering if there's any similar low-
> > level, efficient structure that can be used to replace the char array
> > in my code?
> > 
> > Here's my code:
> > 
> > (defpackage :trie
> >   (:use
> >    :common-lisp)
> >   (:export
> >    :make-trie
> >    :insert-trie
> >    :search-trie)
> >   (:documentation
> >    "Library of ternary trees, inspired by Bentley & Sedgewick
> > article."))
> > 
> > (in-package :trie)
> > 
> > (defun make-trie ()
> >   "Return a trie object, an array, defined as:
> >   0 = split-char
> >   1 = lo-kid
> >   2 = eq-kid
> >   3 = hi-kid"
> >   (make-array '(4) :element-type 'char :initial-contents '(nil nil nil
> > nil)))
> 
> That's not right. You have an array of characters.
> NIL is not a character. You can say element type is
> character and then initialize the elements to NIL
> (which is not a character).

I mean "you can't say.."


> 
> Plus:
> 
> Element 0 is a character. What are elements 1, 2 and 3?
> They are others TRIEs. Not characters. You
> can use a general ARRAY, but not a character ARRAY.
> 
> Use a structure instead of an array. See below.
> 
> 
> > 
> > (defun get-trie-val (trie index)
> >   "Access the trie value defined by index:
> >   0 = split-char
> >   1 = lo-kid
> >   2 = eq-kid
> >   3 = hi-kid"
> >   (aref trie index))
> > 
> > (defun get-split-char (trie)
> >   (get-trie-val trie 0))
> > 
> > (defun get-lo-kid (trie)
> >   (get-trie-val trie 1))
> > 
> > (defun get-eq-kid (trie)
> >   (get-trie-val trie 2))
> > 
> > (defun get-hi-kid (trie)
> >   (get-trie-val trie 3))
> > 
> > (defun set-trie-val (trie index val)
> >   "Set the trie value defined by index:
> >   0 = split-char
> >   1 = lo-kid
> >   2 = eq-kid
> >   3 = hi-kid"
> >   (setf (aref trie index) val))
> 
> SETF is better used to update data. You would
> also want to write your interfaces to provide SETF
> functions. You can do it with (defun (setf ... )...).
> But usully you get it with DEFSTRUCT for free. See below.
> 
> > 
> > (defun set-split-char (trie val)
> >   (set-trie-val trie 0 val))
> > 
> > (defun set-lo-kid (trie val)
> >   (set-trie-val trie 1 val))
> > 
> > (defun set-eq-kid (trie val)
> >   (set-trie-val trie 2 val))
> > 
> > (defun set-hi-kid (trie val)
> >   (set-trie-val trie 3 val))
> 
> A remark: all this boilerplate code could be replaced
> with a single DEFSCTRUCT
> 
> (defstruct trie ...)
> 
> Then you'd get a structure object when you call MAKE-TREE.
> Note that you can also use DEFSTRUCT when you want
> an array as the underlying datastructure. There is a :TYPE
> option to DEFSTRUCT where you can specify that.
> 
> You could also use DEFCLASS if you'd like the flexibility of CLOS
> classes.
> 
> 
> > 
> > (defun trie-empty (trie)
> >   "A trie is empty if its split-char value is undefined."
> >   (null (get-split-char trie)))
> > 
> > (defun insert-trie (trie str)
> >   "Add the string to the trie."
> >   (when (> (length str) 0)
> >     (let ((c (char-downcase (char str 0))))
> >       (when (trie-empty trie)
> >         (setf trie (make-trie))
> >         (set-split-char trie c)
> >         (set-lo-kid trie (make-trie))
> >         (set-eq-kid trie (make-trie))
> >         (set-hi-kid trie (make-trie)))
> >       (cond ((char< c (get-split-char trie))
> >              (set-lo-kid trie (insert-trie (get-lo-kid trie) str)))
> >             ((char= c (get-split-char trie))
> >              (set-eq-kid trie (insert-trie (get-eq-kid trie) (subseq
> > str 1))))
> >             (t
> >              (set-hi-kid trie (insert-trie (get-hi-kid trie) str))))))
> >   trie)
> > 
> > (defun search-trie (trie str)
> >   "Return t if the string exists in the trie, nil if it does not."
> >   (if (> (length str) 0)
> >     (let ((c (char-downcase (char str 0))))
> >       (cond ((null (get-split-char trie))
> >              nil) ; run off end w/o match
> >             ((char< c (get-split-char trie))
> >              (search-trie (get-lo-kid trie) str))
> >             ((char> c (get-split-char trie))
> >              (search-trie (get-hi-kid trie) str))
> >             (t
> >              (search-trie (get-eq-kid trie) (subseq str 1)))))
> >     t))

-- 
http://lispm.dyndns.org
From: dpapathanasiou
Subject: Re: Low-level, Efficient Ternary Search Trees for Storing Strings?
Date: 
Message-ID: <1175259374.710451.166780@l77g2000hsb.googlegroups.com>
Point taken about the array definition; I should *not* use :element-
type 'char since only :split-char is a char and all the :*-kid
components are themselves trie objects, not chars.

But even if I switch from an array to a defstruct, I still wind up
with a very bulky S-expression.

Using this (which, as you say is better, since it offers built-in
make, access and setting functions):

(defstruct trie
  split-char
  lo-kid
  eq-kid
  hi-kid)

If I create a trie to hold a two-letter string, e.g. "as", I get this
S-expression:

#S(TRIE
     :SPLIT-CHAR #\a
     :LO-KID #S(TRIE :SPLIT-CHAR NIL :LO-KID NIL :EQ-KID NIL :HI-KID
NIL)
     :EQ-KID #S(TRIE
                  :SPLIT-CHAR #\s
                  :LO-KID #S(TRIE
                               :SPLIT-CHAR NIL
                               :LO-KID NIL
                               :EQ-KID NIL
                               :HI-KID NIL)
                  :EQ-KID #S(TRIE
                               :SPLIT-CHAR NIL
                               :LO-KID NIL
                               :EQ-KID NIL
                               :HI-KID NIL)
                  :HI-KID #S(TRIE
                               :SPLIT-CHAR NIL
                               :LO-KID NIL
                               :EQ-KID NIL
                               :HI-KID NIL))
     :HI-KID #S(TRIE :SPLIT-CHAR NIL :LO-KID NIL :EQ-KID NIL :HI-KID
NIL))

Serialized to disk, that's 866 bytes.

Versus just 2 bytes (4 counting the double-quotes to denote it as a
string object) if I write "as" just like that.

So what I was getting at was: is there any way of packing the trie
data into a low-level, compact, efficient data structure, similar to
what the other post did for multiple integers into fixnums?

I'd like to be able to use tries for searching, but if I'm stuck
serializing such large S-expressions, I'd be better off saving them as
raw strings and using a fast library like cl-ppcre instead.
From: ············@gmail.com
Subject: Re: Low-level, Efficient Ternary Search Trees for Storing Strings?
Date: 
Message-ID: <1175261605.133156.125240@n59g2000hsh.googlegroups.com>
On Mar 30, 2:56 pm, "dpapathanasiou" <···················@gmail.com>
wrote:
> Point taken about the array definition; I should *not* use :element-
> type 'char since only :split-char is a char and all the :*-kid
> components are themselves trie objects, not chars.
>
> But even if I switch from an array to a defstruct, I still wind up
> with a very bulky S-expression.
>
> Using this (which, as you say is better, since it offers built-in
> make, access and setting functions):
>
> (defstruct trie
>   split-char
>   lo-kid
>   eq-kid
>   hi-kid)
>
> If I create a trie to hold a two-letter string, e.g. "as", I get this
> S-expression:
>
> #S(TRIE
>      :SPLIT-CHAR #\a
>      :LO-KID #S(TRIE :SPLIT-CHAR NIL :LO-KID NIL :EQ-KID NIL :HI-KID
> NIL)
>      :EQ-KID #S(TRIE
>                   :SPLIT-CHAR #\s
>                   :LO-KID #S(TRIE
>                                :SPLIT-CHAR NIL
>                                :LO-KID NIL
>                                :EQ-KID NIL
>                                :HI-KID NIL)
>                   :EQ-KID #S(TRIE
>                                :SPLIT-CHAR NIL
>                                :LO-KID NIL
>                                :EQ-KID NIL
>                                :HI-KID NIL)
>                   :HI-KID #S(TRIE
>                                :SPLIT-CHAR NIL
>                                :LO-KID NIL
>                                :EQ-KID NIL
>                                :HI-KID NIL))
>      :HI-KID #S(TRIE :SPLIT-CHAR NIL :LO-KID NIL :EQ-KID NIL :HI-KID
> NIL))
>
> Serialized to disk, that's 866 bytes.
>
> Versus just 2 bytes (4 counting the double-quotes to denote it as a
> string object) if I write "as" just like that.
>
> So what I was getting at was: is there any way of packing the trie
> data into a low-level, compact, efficient data structure, similar to
> what the other post did for multiple integers into fixnums?
>
> I'd like to be able to use tries for searching, but if I'm stuck
> serializing such large S-expressions, I'd be better off saving them as
> raw strings and using a fast library like cl-ppcre instead.

How about storing it as #S(as) and make the reader turn that into a
trie structure?

Also, are the NIL-only parts really necessary? Could your example be
reduced to the following?

#S(TRIE
     :SPLIT-CHAR #\a
     :LO-KID NIL
     :EQ-KID #S(TRIE
                  :SPLIT-CHAR #\s
                  :LO-KID NIL
                  :EQ-KID NIL
                  :HI-KID NIL)
     :HI-KID NIL)
From: dpapathanasiou
Subject: Re: Low-level, Efficient Ternary Search Trees for Storing Strings?
Date: 
Message-ID: <1175262012.718712.295140@y80g2000hsf.googlegroups.com>
> How about storing it as #S(as) and make the reader turn that into a
> trie structure?

Yes, I thought of that, but counting the time to generate the trie
plus execute the search, I wonder if it wouldn't just be faster to run
a pattern scan on the raw string with cl-ppcre (cl-ppcre is incredibly
fast).

> Also, are the NIL-only parts really necessary? Could your example be
> reduced to the following?
>
> #S(TRIE
>      :SPLIT-CHAR #\a
>      :LO-KID NIL
>      :EQ-KID #S(TRIE
>                   :SPLIT-CHAR #\s
>                   :LO-KID NIL
>                   :EQ-KID NIL
>                   :HI-KID NIL)
>      :HI-KID NIL)

It's a good point, but even that S-expression is >> than the amount of
space need to contain the raw string.
From: Thomas F. Burdick
Subject: Re: Low-level, Efficient Ternary Search Trees for Storing Strings?
Date: 
Message-ID: <1175266239.922028.137280@l77g2000hsb.googlegroups.com>
On Mar 30, 2:56 pm, "dpapathanasiou" <···················@gmail.com>
wrote:
> Point taken about the array definition; I should *not* use :element-
> type 'char since only :split-char is a char and all the :*-kid
> components are themselves trie objects, not chars.
>
> But even if I switch from an array to a defstruct, I still wind up
> with a very bulky S-expression.
>
> Using this (which, as you say is better, since it offers built-in
> make, access and setting functions):
>
> (defstruct trie
>   split-char
>   lo-kid
>   eq-kid
>   hi-kid)
>
> If I create a trie to hold a two-letter string, e.g. "as", I get this
> S-expression:
>
> #S(TRIE
>      :SPLIT-CHAR #\a
>      :LO-KID #S(TRIE :SPLIT-CHAR NIL :LO-KID NIL :EQ-KID NIL :HI-KID
> NIL)
>      :EQ-KID #S(TRIE
>                   :SPLIT-CHAR #\s
>                   :LO-KID #S(TRIE
>                                :SPLIT-CHAR NIL
>                                :LO-KID NIL
>                                :EQ-KID NIL
>                                :HI-KID NIL)
>                   :EQ-KID #S(TRIE
>                                :SPLIT-CHAR NIL
>                                :LO-KID NIL
>                                :EQ-KID NIL
>                                :HI-KID NIL)
>                   :HI-KID #S(TRIE
>                                :SPLIT-CHAR NIL
>                                :LO-KID NIL
>                                :EQ-KID NIL
>                                :HI-KID NIL))
>      :HI-KID #S(TRIE :SPLIT-CHAR NIL :LO-KID NIL :EQ-KID NIL :HI-KID
> NIL))
>
> Serialized to disk, that's 866 bytes.
>
> Versus just 2 bytes (4 counting the double-quotes to denote it as a
> string object) if I write "as" just like that.
>
> So what I was getting at was: is there any way of packing the trie
> data into a low-level, compact, efficient data structure, similar to
> what the other post did for multiple integers into fixnums?
>
> I'd like to be able to use tries for searching, but if I'm stuck
> serializing such large S-expressions, I'd be better off saving them as
> raw strings and using a fast library like cl-ppcre instead.

Sure, use an array of (unsigned-byte 32) for your trie.  Each 32-bit
entry is a node, (byte 8 0) is the char-code of your split-char, (byte
8 8) is lo-kid, (byte 8 16) eq-kid, (byte 8 24) hi-kid.  The kid
values are indexes into trie-array.  Since you don't have cycles, re-
use position 0 to mean NIL.  You can serialize by writing an 8-bit
byte to give the length of the trie, then the series of 32-bit words
which are the contents of the array.

Now, obviously that limits you to 8-bit characters and 256-node tries,
but it should be obvious how to extend it.
From: Alex Mizrahi
Subject: Re: Low-level, Efficient Ternary Search Trees for Storing Strings?
Date: 
Message-ID: <460fb382$0$90265$14726298@news.sunsite.dk>
(message (Hello 'dpapathanasiou)
(you :wrote  :on '(30 Mar 2007 05:56:14 -0700))
(

 d> So what I was getting at was: is there any way of packing the trie
 d> data into a low-level, compact, efficient data structure, similar to
 d> what the other post did for multiple integers into fixnums?

 d> I'd like to be able to use tries for searching, but if I'm stuck
 d> serializing such large S-expressions, I'd be better off saving them as
 d> raw strings and using a fast library like cl-ppcre instead.

i still have no clue what would you like to do, and i suspect you don't know 
it for sure too, since a person who perfectly understands the process should 
have no problems implementing it.

S-expressions are on-disk, structures are in-memory.
but you want to save raw strings on-disk, and ..cl-ppcre in memory??

how does cl-ppcre relate to s-expressions on disk??

why can't you have raw strings in file, then read them into structures, and 
use fast search algorithms?

actually what's optimal depends on your usage patterns. i some cases, when 
you have large string sets, it's not feasible to read whole thing into 
memory -- you can mmap a file, and operate on it.

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
"?? ???? ??????? ?????") 
From: dpapathanasiou
Subject: Re: Low-level, Efficient Ternary Search Trees for Storing Strings?
Date: 
Message-ID: <1175444974.651989.8380@y66g2000hsf.googlegroups.com>
> i still have no clue what would you like to do, and i suspect you don't know
> it for sure too, since a person who perfectly understands the process should
> have no problems implementing it.

The post by Thomas F. Burdick answered my question.
From: Holger Schauer
Subject: Re: Low-level, Efficient Ternary Search Trees for Storing Strings?
Date: 
Message-ID: <yxzzm5qssc1.fsf@gmx.de>
On 4958 September 1993, dpapathanasiou wrote:
> Point taken about the array definition; I should *not* use :element-
> type 'char since only :split-char is a char and all the :*-kid
> components are themselves trie objects, not chars.

I have a toy implementation of ternary trees in CLOS, just in case you're
interested. Actually, I have two, another one for the general case
i.e., in which it's not assumed that you're comparing characters. 

I stopped playing around with it when I ran into the exact same
problem of storing the data to disc and when I discovered that I
couldn't beat hash tables at all (on a 200MB word list) -- I checked
on at least three CL implementations. And it's unbelievable slower
than the C code of B&S. IIRC, this is mainly due to their use of
unions to implement the data structure, but it might be that I mix
that up with the amount of memory used (hmm, I think I had put in some
optimization settings that resulted in at least one CL implementation
in slots not being created per default in fresh objects, which
minimized the memory footprint, but in turn meant I had to check
whether a slot really existed, slowing down the code considerably).

> But even if I switch from an array to a defstruct, I still wind up
> with a very bulky S-expression.

Yup.

> I'd like to be able to use tries for searching, but if I'm stuck
> serializing such large S-expressions, I'd be better off saving them as
> raw strings and using a fast library like cl-ppcre instead.

Exactly.

Holger

-- 
---          http://hillview.bugwriter.net/            ---
"You see, when you upload, the computer has to push the bits upward,
 and let me tell you, when you're talking millions of bits, it gets
 heavy. When you download, the bits just fall by gravity, so it's much
 faster (we provided you with some remarkable cushioning functions that
 prevent damage to your data). Non-ADSL? They use helium. That's why
 it's more expensive." -- seen on WTF
From: dpapathanasiou
Subject: Re: Low-level, Efficient Ternary Search Trees for Storing Strings?
Date: 
Message-ID: <1175607307.847271.152490@l77g2000hsb.googlegroups.com>
> I have a toy implementation of ternary trees in CLOS, just in case you're
> interested. Actually, I have two, another one for the general case
> i.e., in which it's not assumed that you're comparing characters.

Actually, I would be interested in seeing that; I did a "Most Obvious
Way" (tm) implementation after reading the article, but I didn't take
it any further, because of the reasons explained above.
From: Szymon
Subject: Re: Low-level, Efficient Ternary Search Trees for Storing Strings?
Date: 
Message-ID: <f03i9n$ktr$1@atlantis.news.tpi.pl>
dpapathanasiou wrote:

 > [...]
> #S(TRIE
>      :SPLIT-CHAR #\a
>      :LO-KID #S(TRIE :SPLIT-CHAR NIL :LO-KID NIL :EQ-KID NIL :HI-KID
> NIL)
>      :EQ-KID #S(TRIE
>                   :SPLIT-CHAR #\s
>                   :LO-KID #S(TRIE
>                                :SPLIT-CHAR NIL
>                                :LO-KID NIL
>                                :EQ-KID NIL
>                                :HI-KID NIL)
>                   :EQ-KID #S(TRIE
>                                :SPLIT-CHAR NIL
>                                :LO-KID NIL
>                                :EQ-KID NIL
>                                :HI-KID NIL)
>                   :HI-KID #S(TRIE
>                                :SPLIT-CHAR NIL
>                                :LO-KID NIL
>                                :EQ-KID NIL
>                                :HI-KID NIL))
>      :HI-KID #S(TRIE :SPLIT-CHAR NIL :LO-KID NIL :EQ-KID NIL :HI-KID
> NIL))
> 
> Serialized to disk, that's 866 bytes.
> 
> Versus just 2 bytes (4 counting the double-quotes to denote it as a
> string object) if I write "as" just like that.
 > [...]

Why not ordinary tree (not byte-level but shorter than yours):

((*) (A (* (S (T)))))

Regards, Szymon.
From: Tim Bradshaw
Subject: Re: Low-level, Efficient Ternary Search Trees for Storing Strings?
Date: 
Message-ID: <1175614280.962027.46390@n76g2000hsh.googlegroups.com>
On Mar 29, 10:10 pm, "dpapathanasiou" <···················@gmail.com>
wrote:
> After reading Bentley and Sedgewick's paper on using Ternary Search
> Trees for storing strings (http://www.ddj.com/dept/windows/184410528),
> I wrote an implementation in CL, using a simple char array as the trie
> structure.
>
> It works, and searching is fast, but the resulting tries (arrays) are
> massive, especially when saved to disk.
>

A couple of related points about this:
1. The performance of these sorts of structures is heavily influenced
by locality issues.  I haven't thought about the structure you've
described but I suspect it may not be very good in this regard.  In
particular superficially awful approaches (linear search) do really
rather well because the memory system loves them.  Databases learnt
this lesson long ago since it's mattered for disks for longer than
it's mattered for memory.
2. Why store the structure on disk at all?  Is it faster or slower to
store the strings and "intern" them as you read them from the disk.

--tim