reloading a boatload of symbols more quickly

From: David J Cooper Jr
Subject: reloading a boatload of symbols more quickly
Date: Tue, 18 Mar 2003 23:23:51 +0000
Message-ID: <m3d6kouum0.fsf@dykstra.genworks.com>

Hi,

 I am looking for a fast way to reload a whole boatload of symbols (in
 the range of 500,000 of them) into a hash table. These symbols are
 built up as tokens from scanning emails for a Paul Graham "Plan for
 Spam"-style email classifier I am doing, called Squeakymail, for
 which I expect to release the source Real Soon Now.

 While the filter is running, the tables of good and bad tokens get
 built up incrementally, which is not a performance problem.

 But periodically I have to save each (user-specific) token hash table
 to disk, so they can be reloaded, and the hash tables restored, when
 the filter process is rebooted. What appears to take most of the time
 in this reloading is not really the reading, and not the populating
 of the hash table itself, but re-interning the symbols into the token
 package (a special package I have set up to contain all the symbols
 being used as classification tokens).

 I write the hash tables out as simply a file of pairs containing
 symbols and integers, e.g:

   ...
   (SQUEAKYMAIL-TOKENS::|Subject*CASE| . 1) 
   (SQUEAKYMAIL-TOKENS::|Received*617166;| . 1) 
   (SQUEAKYMAIL-TOKENS::|ticking!<br>| . 1) 
   (SQUEAKYMAIL-TOKENS::|kicks| . 1) 
   (SQUEAKYMAIL-TOKENS::|bills| . 26) 
   (SQUEAKYMAIL-TOKENS::|Bills| . 8) 
   ...

 And rebuild the hash tables by reading pairs from this file and
 setf'ing the corresponding hash table entries until hitting
 end-of-file.

 From simply reading this huge file containing pairs of symbols and
 integers, The Allegro CL profiler (prof:show-flat-profile) reveals
 the following:

    %     %     self  total            self   total  Function
   Time  Cum.   secs   secs    calls ms/call ms/call   name
   36.2  36.2  178.1  178.2                          EXCL::DO-REHASH
   33.1  69.3  162.5  162.5                          EXCL::GETHASH-STRING
   ...

 which says that most of the time goes into hashing based on symbol
 names and rehashing the package's symbol-table. (This is from simply
 reading all the pairs from the file and having the reader do its
 interning, _not_ actually building up my own hash tables --- building
 up my own hash tables once the symbols are read adds only slightly to
 the overall time).

 Re-reading the same huge file in the same running CL session runs in
 about 1/20 the time, and the profiler reveals almost no calls to
 EXCL::DO-REHASH or EXCL::GETHASH-STRING, presumably because all these
 symbols are already interned in the :squeakymail-tokens package.

 So my basic question is, is there a way to make all this reading of
 symbols go more quickly, by preparing the package ahead of time, or
 helping the reader along somehow? I am already defining the package
 with a size, as in 

   (defpackage :squeakymail-tokens (:use) (:size 500000))

 I suppose a higher-level question is, is this even an appropriate use
 of symbols and packages? According to Paul Graham's examples and the
 Henley example from ANSI_Common_Lisp, it would appear that it is
 appropriate.

Thanks,

 -dave
 David J Cooper Jr, Genworks International

Re: reloading a boatload of symbols more quickly Paul F. Dietz
Re: reloading a boatload of symbols more quickly Henrik Motakef
Re: reloading a boatload of symbols more quickly Tim Bradshaw
Re: reloading a boatload of symbols more quickly Joe Marshall

From: Paul F. Dietz
Subject: Re: reloading a boatload of symbols more quickly
Date: Wed, 19 Mar 2003 00:05:17 +0000
Message-ID: <mWCdnWgKYcAdLuqjXTWcqA@dls.net>

David J Cooper Jr wrote:

>  I write the hash tables out as simply a file of pairs containing
>  symbols and integers, e.g:
> 
>    ...
>    (SQUEAKYMAIL-TOKENS::|Subject*CASE| . 1) 
>    (SQUEAKYMAIL-TOKENS::|Received*617166;| . 1) 
>    (SQUEAKYMAIL-TOKENS::|ticking!<br>| . 1) 
>    (SQUEAKYMAIL-TOKENS::|kicks| . 1) 
>    (SQUEAKYMAIL-TOKENS::|bills| . 26) 
>    (SQUEAKYMAIL-TOKENS::|Bills| . 8) 
>    ...
> 
>  And rebuild the hash tables by reading pairs from this file and
>  setf'ing the corresponding hash table entries until hitting
>  end-of-file.

You might consider writing and reading the file with *PACKAGE*
bound to (find-package "SQUEAKYNMAIL-TOKENS").  No sense forcing
that package to be found again (or, for that matter, the characters
SQUEAKYMAIL-TOKENS:: to be read) for every symbol.

	Paul

From: Henrik Motakef
Subject: Re: reloading a boatload of symbols more quickly
Date: Wed, 19 Mar 2003 00:08:10 +0000
Message-ID: <87k7ewp6ad.fsf@interim.henrik-motakef.de>

David J Cooper Jr <·······@genworks.com> writes:

>  I write the hash tables out as simply a file of pairs containing
>  symbols and integers, e.g:
> 
>    ...
>    (SQUEAKYMAIL-TOKENS::|Subject*CASE| . 1) 
>    (SQUEAKYMAIL-TOKENS::|Received*617166;| . 1) 
>    (SQUEAKYMAIL-TOKENS::|ticking!<br>| . 1) 
>    (SQUEAKYMAIL-TOKENS::|kicks| . 1) 
>    (SQUEAKYMAIL-TOKENS::|bills| . 26) 
>    (SQUEAKYMAIL-TOKENS::|Bills| . 8) 
>    ...

This may or may not solve your original problem, but why don't you
just use the compiler to deal with saving the hash table?

;; session one

(defvar *hash-table* (make-hash-table))
(setf (gethash 'squeakymail-tokens::|Subject*CASE| *hash-table*) 1)
(setf (gethash 'squeakymail-tokens::|Recieved*617166;| *hash-table*) 1)

(with-open-file (s "hashtable-reloader.lisp" :direction :output)
  (write-string "(defparameter *hash-table* #.*hash-table*)" s))

(compile-file "hashtable-reloader.lisp")

;; session two, lisp restarted in the mean time

(load "hashtable-reloader.fasl") ;; or whatever file type
(gethash 'squeakymail-tokens::|Subject*CASE| *hash-table*)
  => 1


Extra points for dealing with packages properly.

Regards
Henrik

From: Tim Bradshaw
Subject: Re: reloading a boatload of symbols more quickly
Date: Wed, 19 Mar 2003 08:43:24 +0000
Message-ID: <ey34r5z21cj.fsf@cley.com>

* David J Cooper, wrote:

>  So my basic question is, is there a way to make all this reading of
>  symbols go more quickly, by preparing the package ahead of time, or
>  helping the reader along somehow? I am already defining the package
>  with a size, as in 

One thing might be to write out a FASL file (or a source file which
you then compile).  It may be that the interning process is faster
then.  

...

>  I suppose a higher-level question is, is this even an appropriate use
>  of symbols and packages? According to Paul Graham's examples and the
>  Henley example from ANSI_Common_Lisp, it would appear that it is
>  appropriate.

...

And another thing might be not to do that.  If you use strings and
intern them in an EQUAL hash table then you'll avoid the overhead
(both time and space) of interning them in a package - essentially
this is designing your own hopefully-lightweight package system.  The
problem is that interning in an EQUAL hashtable might take much longer
than interning in an EQL one, so you may lose anyway.

Alternatives might include using some sort of string-tree (trie)
structure which you would have to implement.  Done right these can ba
fast and compact I think (since all the prefixes are shared
structure).

--tim

From: Joe Marshall
Subject: Re: reloading a boatload of symbols more quickly
Date: Wed, 19 Mar 2003 14:55:08 +0000
Message-ID: <isufqucz.fsf@ccs.neu.edu>

David J Cooper Jr <·······@genworks.com> writes:

>  But periodically I have to save each (user-specific) token hash table
>  to disk, so they can be reloaded, and the hash tables restored, when
>  the filter process is rebooted. What appears to take most of the time
>  in this reloading is not really the reading, and not the populating
>  of the hash table itself, but re-interning the symbols into the token
>  package (a special package I have set up to contain all the symbols
>  being used as classification tokens).

Yet another option (the others mentioned seem good to me) might be to
dump the entire image periodically.  Whether this is a good idea
depends on how often you dump the tables, how often you need to reload
and the pain associated with reloading.