Hi,
I am looking for a fast way to reload a whole boatload of symbols (in
the range of 500,000 of them) into a hash table. These symbols are
built up as tokens from scanning emails for a Paul Graham "Plan for
Spam"-style email classifier I am doing, called Squeakymail, for
which I expect to release the source Real Soon Now.
While the filter is running, the tables of good and bad tokens get
built up incrementally, which is not a performance problem.
But periodically I have to save each (user-specific) token hash table
to disk, so they can be reloaded, and the hash tables restored, when
the filter process is rebooted. What appears to take most of the time
in this reloading is not really the reading, and not the populating
of the hash table itself, but re-interning the symbols into the token
package (a special package I have set up to contain all the symbols
being used as classification tokens).
I write the hash tables out as simply a file of pairs containing
symbols and integers, e.g:
...
(SQUEAKYMAIL-TOKENS::|Subject*CASE| . 1)
(SQUEAKYMAIL-TOKENS::|Received*617166;| . 1)
(SQUEAKYMAIL-TOKENS::|ticking!<br>| . 1)
(SQUEAKYMAIL-TOKENS::|kicks| . 1)
(SQUEAKYMAIL-TOKENS::|bills| . 26)
(SQUEAKYMAIL-TOKENS::|Bills| . 8)
...
And rebuild the hash tables by reading pairs from this file and
setf'ing the corresponding hash table entries until hitting
end-of-file.
From simply reading this huge file containing pairs of symbols and
integers, The Allegro CL profiler (prof:show-flat-profile) reveals
the following:
% % self total self total Function
Time Cum. secs secs calls ms/call ms/call name
36.2 36.2 178.1 178.2 EXCL::DO-REHASH
33.1 69.3 162.5 162.5 EXCL::GETHASH-STRING
...
which says that most of the time goes into hashing based on symbol
names and rehashing the package's symbol-table. (This is from simply
reading all the pairs from the file and having the reader do its
interning, _not_ actually building up my own hash tables --- building
up my own hash tables once the symbols are read adds only slightly to
the overall time).
Re-reading the same huge file in the same running CL session runs in
about 1/20 the time, and the profiler reveals almost no calls to
EXCL::DO-REHASH or EXCL::GETHASH-STRING, presumably because all these
symbols are already interned in the :squeakymail-tokens package.
So my basic question is, is there a way to make all this reading of
symbols go more quickly, by preparing the package ahead of time, or
helping the reader along somehow? I am already defining the package
with a size, as in
(defpackage :squeakymail-tokens (:use) (:size 500000))
I suppose a higher-level question is, is this even an appropriate use
of symbols and packages? According to Paul Graham's examples and the
Henley example from ANSI_Common_Lisp, it would appear that it is
appropriate.
Thanks,
-dave
David J Cooper Jr, Genworks International
From: Paul F. Dietz
Subject: Re: reloading a boatload of symbols more quickly
Date:
Message-ID: <mWCdnWgKYcAdLuqjXTWcqA@dls.net>
David J Cooper Jr wrote:
> I write the hash tables out as simply a file of pairs containing
> symbols and integers, e.g:
>
> ...
> (SQUEAKYMAIL-TOKENS::|Subject*CASE| . 1)
> (SQUEAKYMAIL-TOKENS::|Received*617166;| . 1)
> (SQUEAKYMAIL-TOKENS::|ticking!<br>| . 1)
> (SQUEAKYMAIL-TOKENS::|kicks| . 1)
> (SQUEAKYMAIL-TOKENS::|bills| . 26)
> (SQUEAKYMAIL-TOKENS::|Bills| . 8)
> ...
>
> And rebuild the hash tables by reading pairs from this file and
> setf'ing the corresponding hash table entries until hitting
> end-of-file.
You might consider writing and reading the file with *PACKAGE*
bound to (find-package "SQUEAKYNMAIL-TOKENS"). No sense forcing
that package to be found again (or, for that matter, the characters
SQUEAKYMAIL-TOKENS:: to be read) for every symbol.
Paul
David J Cooper Jr <·······@genworks.com> writes:
> I write the hash tables out as simply a file of pairs containing
> symbols and integers, e.g:
>
> ...
> (SQUEAKYMAIL-TOKENS::|Subject*CASE| . 1)
> (SQUEAKYMAIL-TOKENS::|Received*617166;| . 1)
> (SQUEAKYMAIL-TOKENS::|ticking!<br>| . 1)
> (SQUEAKYMAIL-TOKENS::|kicks| . 1)
> (SQUEAKYMAIL-TOKENS::|bills| . 26)
> (SQUEAKYMAIL-TOKENS::|Bills| . 8)
> ...
This may or may not solve your original problem, but why don't you
just use the compiler to deal with saving the hash table?
;; session one
(defvar *hash-table* (make-hash-table))
(setf (gethash 'squeakymail-tokens::|Subject*CASE| *hash-table*) 1)
(setf (gethash 'squeakymail-tokens::|Recieved*617166;| *hash-table*) 1)
(with-open-file (s "hashtable-reloader.lisp" :direction :output)
(write-string "(defparameter *hash-table* #.*hash-table*)" s))
(compile-file "hashtable-reloader.lisp")
;; session two, lisp restarted in the mean time
(load "hashtable-reloader.fasl") ;; or whatever file type
(gethash 'squeakymail-tokens::|Subject*CASE| *hash-table*)
=> 1
Extra points for dealing with packages properly.
Regards
Henrik
From: Tim Bradshaw
Subject: Re: reloading a boatload of symbols more quickly
Date:
Message-ID: <ey34r5z21cj.fsf@cley.com>
* David J Cooper, wrote:
> So my basic question is, is there a way to make all this reading of
> symbols go more quickly, by preparing the package ahead of time, or
> helping the reader along somehow? I am already defining the package
> with a size, as in
One thing might be to write out a FASL file (or a source file which
you then compile). It may be that the interning process is faster
then.
...
> I suppose a higher-level question is, is this even an appropriate use
> of symbols and packages? According to Paul Graham's examples and the
> Henley example from ANSI_Common_Lisp, it would appear that it is
> appropriate.
...
And another thing might be not to do that. If you use strings and
intern them in an EQUAL hash table then you'll avoid the overhead
(both time and space) of interning them in a package - essentially
this is designing your own hopefully-lightweight package system. The
problem is that interning in an EQUAL hashtable might take much longer
than interning in an EQL one, so you may lose anyway.
Alternatives might include using some sort of string-tree (trie)
structure which you would have to implement. Done right these can ba
fast and compact I think (since all the prefixes are shared
structure).
--tim
From: Joe Marshall
Subject: Re: reloading a boatload of symbols more quickly
Date:
Message-ID: <isufqucz.fsf@ccs.neu.edu>
David J Cooper Jr <·······@genworks.com> writes:
> But periodically I have to save each (user-specific) token hash table
> to disk, so they can be reloaded, and the hash tables restored, when
> the filter process is rebooted. What appears to take most of the time
> in this reloading is not really the reading, and not the populating
> of the hash table itself, but re-interning the symbols into the token
> package (a special package I have set up to contain all the symbols
> being used as classification tokens).
Yet another option (the others mentioned seem good to me) might be to
dump the entire image periodically. Whether this is a good idea
depends on how often you dump the tables, how often you need to reload
and the pain associated with reloading.