The lisp code for the word-frequency benchmark on the shootout is
giving bad results in the official tests:
http://shootout.alioth.debian.org/benchmark.php?test=wordfreq&lang=cmucl&id=0&sort=fullcpu
I believe this is because the same hashtable is being re-used in
successive runs. Note that in two cases the lisp code gave an answer
that was exactly 3 times the correct answer, suggesting that the code
ran 3 times, increasing the numbers on each run.
My fix is to add (clrhash *ht*) to the end of the main function:
(defun main ()
(readit)
(printit)
(clrhash *ht*))
##########################################################
Before I submit this correction, does anyone have ideas for improving
the speed?
##########################################################
I have placed a copy of the lisp code at
plaza.ufl.edu/lavigne/word-frequency.lisp
which I will update with any suggestions C.L.L.ers come up with. I have
added a
(defun test....)
to the bottom which I found useful for testing. It just maps standard
input and output to files so I can run tests from the REPL. I will
remove that function before submitting.
>Before I submit this correction, does anyone have
>ideas for improving the speed?
>I have placed a copy of the lisp code at
> plaza.ufl.edu/lavigne/word-frequency.lisp
>which I will update with any suggestions C.L.L.ers
>come up with.
Here are the profiling results:
total-run-time: 3.2-3.3 seconds
readit: 2.3 seconds
sort: .7 seconds
print: .15 seconds
The readit has two big operations, reading from the file and storing
results in a hashtable. When I remove the part that stores results,
readit finishes in only .01 seconds, so it seems that most time is
spent on adding results to the hashtable. Of course, if the compiler is
really smart then it noticed I wasn't doing anything with the stuff
being read in... and decided not to waste time reading it ^^
Eric Lavigne wrote:
> The lisp code for the word-frequency benchmark on the shootout is
> giving bad results in the official tests:
> http://shootout.alioth.debian.org/benchmark.php?test=wordfreq&lang=cmucl&id=0&sort=fullcpu
>
> I believe this is because the same hashtable is being re-used in
> successive runs. Note that in two cases the lisp code gave an answer
> that was exactly 3 times the correct answer, suggesting that the code
> ran 3 times, increasing the numbers on each run.
>
> My fix is to add (clrhash *ht*) to the end of the main function:
> (defun main ()
> (readit)
> (printit)
> (clrhash *ht*))
Why not /start/ with clrhash? Or make a new hash table on each run?
--
Kenny
Why Lisp? http://lisp.tech.coop/RtL%20Highlight%20Film
"If you plan to enter text which our system might consider to be
obscene, check here to certify that you are old enough to hear the
resulting output." -- Bell Labs text-to-speech interactive Web page
>> My fix is to add (clrhash *ht*) to the end of the
>> main function:
>> (defun main ()
>> (readit)
>> (printit)
>> (clrhash *ht*))
>Why not /start/ with clrhash? Or make a new hash table on >each run?
>
>Kenny
I have placed the 3 global variables (buffer, word, and ht) in a let
block so they are no longer global. add-word, readit, and printit are
now closures within that let block. ht is initialized at the beginning
of readit. clrhash is no longer needed in main.
I didn't notice any speed change (run time varies more between runs
than between program versions), but it does look nicer. I will post the
new version in a few minutes. As before, the latest version is kept at
plaza.ufl.edu/lavigne/word-frequency.lisp
>I didn't notice any speed change (run
>time varies more between runs than
>between program versions), but it does
>look nicer. I will post the new version in
>a few minutes. As before, the latest
>version is kept at
>plaza.ufl.edu/lavigne/word-frequency.lisp
I have to take that back. Using "let" instead of "defparameter" dropped
the speed by around 7%. This is very strange, since the logic didn't
change at all.
Anyway, (time (clrhash *ht*)) gives 0.0 seconds (when *ht* is full) so
moving it around doesn't seem like a big issue. I've put the code back
as it was (with clrhash at the end of main).
Eric Lavigne wrote:
>>I didn't notice any speed change (run
>>time varies more between runs than
>>between program versions), but it does
>>look nicer. I will post the new version in
>>a few minutes. As before, the latest
>>version is kept at
>>plaza.ufl.edu/lavigne/word-frequency.lisp
>
>
> I have to take that back. Using "let" instead of "defparameter" dropped
> the speed by around 7%. This is very strange, since the logic didn't
> change at all.
You can use let /and/ defparameter:
(let ((*ht* (make-hash-table ...)
*word*
*whatever*)
....)
Then you won't be messing with closures, which might explain the
performance hit.
>
> Anyway, (time (clrhash *ht*)) gives 0.0 seconds (when *ht* is full) so
> moving it around doesn't seem like a big issue. I've put the code back
> as it was (with clrhash at the end of main).
Was the problem that *ht* not clear after the run, or that it was
populated when a run started? I gather the latter. The most reliable way
to make sure X is clear when a run starts is to clear X at the start of
the run. It also eliminates the confusion to someone of wondering why on
earth the *ht* is being cleared as the last step.
my2.
--
Kenny
Why Lisp? http://lisp.tech.coop/RtL%20Highlight%20Film
"If you plan to enter text which our system might consider to be
obscene, check here to certify that you are old enough to hear the
resulting output." -- Bell Labs text-to-speech interactive Web page
>Was the problem that *ht* not clear after the run, or that it was
>populated when a run started? I gather the latter. The most reliable way
>to make sure X is clear when a run starts is to clear X at the start of
>the run. It also eliminates the confusion to someone of wondering why on
>earth the *ht* is being cleared as the last step.
Makes sense. I also like having the hashtable still full after a run so
I can take a peak inside ^^ My reasoning was that it looked kinda
strange to clear a hashtable when it hasn't been filled yet (first
run). The main thing is to make sure it is cleared between runs, which
will happen in either case. Anyway, it is moved to the front for now.
Current versions:
plaza.ufl.edu/lavigne/word-frequency.lisp
plaza.ufl.edu/lavigne/word-frequency_neuss.lisp
(Second is more elegant but broken. How to
make reader treat spaces, commas, etc just
like spaces?)
Eric Lavigne wrote:
>>Was the problem that *ht* not clear after the run, or that it was
>>populated when a run started? I gather the latter. The most reliable way
>>to make sure X is clear when a run starts is to clear X at the start of
>>the run. It also eliminates the confusion to someone of wondering why on
>>earth the *ht* is being cleared as the last step.
>
>
> Makes sense. I also like having the hashtable still full after a run so
> I can take a peak inside ^^
Yes, I almost mentioned that.
> My reasoning was that it looked kinda
> strange to clear a hashtable when it hasn't been filled yet (first
> run).
Agreed. But if I see a standalone chunk of code begin with (clrhash
*ht*), I know that somewhere there is a (defparametr *ht* (mht)) or
(setf *ht* (mht)), ie that there is a hash table object just sitting in
*ht*, and with that same realization comes understanding of the opening
(clrhash *ht*).
> The main thing is to make sure it is cleared between runs, which
> will happen in either case. Anyway, it is moved to the front for now.
Did you try the: (let ((*ht*) (mht))... approach to see if that cured
the performance hit? Mind you, you then have to return *ht* so you can
inspect it.
--
Kenny
Why Lisp? http://lisp.tech.coop/RtL%20Highlight%20Film
"If you plan to enter text which our system might consider to be
obscene, check here to certify that you are old enough to hear the
resulting output." -- Bell Labs text-to-speech interactive Web page
>Did you try the: (let ((*ht*) (mht))... approach to see if that cured
>the performance hit? Mind you, you then have to return *ht* so you can
>inspect it.
Actually, I have no idea how to use (let ((*ht*... without creating
closures. If I put "let" inside of a function then it will be
reevaluated (emptied) every time the function runs. Also, there are
several functions that need access to that variable. Could you post an
example of what you're thinking so I can understand better?
Anyway, the whole point of a closure was to eliminate the
defparameters. I tend to avoid globals because it forces me to think
about the whole code at once (where else does this variable get
modified?). If I need to make the variable available to every function,
though, I might as well do it with a global. It bothers me a bit that
closures have worse performance than globals, just because I will
probably want to use them later, but it doesn't seem to be a big issue
on this project.
Eric Lavigne wrote:
>>Did you try the: (let ((*ht*) (mht))... approach to see if that cured
>>the performance hit? Mind you, you then have to return *ht* so you can
>>inspect it.
>
>
> Actually, I have no idea how to use (let ((*ht*... without creating
> closures. If I put "let" inside of a function then it will be
> reevaluated (emptied) every time the function runs. Also, there are
> several functions that need access to that variable. Could you post an
> example of what you're thinking so I can understand better?
(defvar *ht*)
(defun main ()
(let ((*ht* (make-hash-table...)))
(step-1)
(step-2)
*ht*))
Now all your code can see *ht* (and the other "globals" you had in
mind). I antlered globals because they work differently than, say, C
globals. For example, in the above example, the binding to *ht* goes
away when the let form is exited. Even better, if it was bound to
something else when the let was entered, it reverts to that binding. So
main will /return/ the hash table used during the run of main, but if
you inspect *ht* after running main you will find it to be nil.
Globals are a no-no in programming, but this "special" behavior rocks.
Back in the day programming Mac Quickdraw, one always had to:
GetPort( &savePort );
SetPort( <the port I want to draw to>);
draw draw draw
SetPort( savePort );
And, again, while globals generally bite the big one, in a case like
QuickDraw it get old in a hurry forever passing something like the
GrafPort around. Every conceivable function in the imaging call tree has
to relay the frickin thing around.
In a case like Cells, where it is vital for every slot access to know
who is asking, only specials will serve.
>
> Anyway, the whole point of a closure was to eliminate the
> defparameters. I tend to avoid globals because it forces me to think
> about the whole code at once (where else does this variable get
> modified?). If I need to make the variable available to every function,
> though, I might as well do it with a global. It bothers me a bit that
> closures have worse performance than globals, just because I will
> probably want to use them later, but it doesn't seem to be a big issue
> on this project.
>
--
Kenny
Why Lisp? http://lisp.tech.coop/RtL%20Highlight%20Film
"If you plan to enter text which our system might consider to be
obscene, check here to certify that you are old enough to hear the
resulting output." -- Bell Labs text-to-speech interactive Web page
>(defvar *ht*)
>
>(defun main ()
> (let ((*ht* (make-hash-table...)))
> (step-1)
> (step-2)
> *ht*))
>Now all your code can see *ht* (and the other "globals" you had in
>mind). I antlered globals because they work differently than, say, C
>globals. For example, in the above example, the binding to *ht* goes
>away when the let form is exited. Even better, if it was bound to
>something else when the let was entered, it reverts to that binding. So
>main will /return/ the hash table used during the run of main, but if
>you inspect *ht* after running main you will find it to be nil.
>
>Globals are a no-no in programming, but this "special" behavior rocks.
Hehe, it's hard to believe I could forget this "special" behavior. The
first time I asked for help on C.L.L. was because this "special"
behavior had broken the encapsulation on all my functions and filled my
code with bugs. I guess after getting bit so hard it didn't occur to me
that this behavior could be useful.
I'm taking a jogging break now, but I'll be back to try this out in a
bit.
>(defvar *ht*)
>(defun main ()
> (let ((*ht* (make-hash-table...)))
> (step-1)
> (step-2)
> *ht*))
There is no (noticeable) performance penalty for doing this. In 3 out
of 5 trials, using let here was faster than using clrhash. I am making
this the new version.
plaza.ufl.edu/lavigne/word-frequency.lisp
plaza.ufl.edu/lavigne/world-frequency_neuss.lisp
Eric Lavigne wrote:
>>(defvar *ht*)
>
>
>>(defun main ()
>> (let ((*ht* (make-hash-table...)))
>> (step-1)
>> (step-2)
>> *ht*))
>
>
> There is no (noticeable) performance penalty for doing this. In 3 out
> of 5 trials, using let here was faster than using clrhash. I am making
> this the new version.
>
> plaza.ufl.edu/lavigne/word-frequency.lisp
> plaza.ufl.edu/lavigne/world-frequency_neuss.lisp
>
> (defun main ()
> (let ((*ht* (make-hash-table :size 60000 :test 'equal)))
Note that now the defparameter form should not bind a hash table, since
it will get supplanted on every run.
> (readit)
> (printit)
> *ht*))
>
--
Kenny
Why Lisp? http://lisp.tech.coop/RtL%20Highlight%20Film
"If you plan to enter text which our system might consider to be
obscene, check here to certify that you are old enough to hear the
resulting output." -- Bell Labs text-to-speech interactive Web page
>Note that now the defparameter form should not bind
>a hash table, since it will get supplanted on every run.
defvar then? My understanding of the difference is that defparameter
only creates new globals, whereas defvar can also change the value of
existing variables. Either seems to work here. Maybe the idea is that
defparameter implies a constant?
Eric Lavigne wrote:
>>Note that now the defparameter form should not bind
>>a hash table, since it will get supplanted on every run.
>
>
> defvar then? My understanding of the difference is that defparameter
> only creates new globals, whereas defvar can also change the value of
> existing variables.
defparameter is the one that applies a changed binding each time it is
evaluated, defvar only for new specials.
> Either seems to work here.
yep.
> Maybe the idea is that
> [defvar] implies a constant?
I do not know how the two variants arose or the fine details of the
difference. I would use:
(defvar *long-lived-dictionary* (make-hash-table))
...so I could work on and periodically recompile/load the whole source
file without losing the stuff in *long-lived-dictionary* (a different
situation than yours) and defparameter where I wanted it reset each time
I reloaded.
--
Kenny
Why Lisp? http://lisp.tech.coop/RtL%20Highlight%20Film
"If you plan to enter text which our system might consider to be
obscene, check here to certify that you are old enough to hear the
resulting output." -- Bell Labs text-to-speech interactive Web page
>I do not know how the two variants arose or the fine details of the
>difference. I would use:
> (defvar *long-lived-dictionary* (make-hash-table))
>...so I could work on and periodically recompile/load the whole source
>file without losing the stuff in *long-lived-dictionary* (a different
>situation than yours) and defparameter where I wanted it reset each time
>I reloaded.
Change made. It looks like Nicolas's code will win this match, though,
as it is shorter and faster.
plaza.ufl.edu/lavigne/word-frequency.lisp
plaza.ufl.edu/lavigne/word-frequency_neuss.lisp
"Eric Lavigne" <············@gmail.com> writes:
> ##########################################################
> Before I submit this correction, does anyone have ideas for improving
> the speed?
> ##########################################################
I guess that the code could be made much shorter and faster by using the CL
reader for reading the words of the text and working with symbols instead.
Nicolas.
>I guess that the code could be made much shorter and faster by using the CL
>reader for reading the words of the text and working with symbols instead.
>Nicolas.
The example text contains parentheses (11 pairs), so a symbolic version
would have to process lists on occassion.
Punctuation is also an issue, since "cat" and "cat." need to be
considered the same symbol.
Are symbols really faster? I would expect symbol comparison to be
faster than string comparison, but the reader still needs to convert a
set of characters to a symbol. It probably uses a hash table for that,
leading to the same performance problem as the number of symbols
becomes very high.
Probably I am misunderstanding something here. I am relatively new to
both lisp and speed tuning.
________________
plaza.ufl.edu/lavigne/word-frequency.lisp
"Eric Lavigne" <············@gmail.com> writes:
> Are symbols really faster? I would expect symbol comparison to be
> faster than string comparison, but the reader still needs to convert a
> set of characters to a symbol. It probably uses a hash table for that,
> leading to the same performance problem as the number of symbols
> becomes very high.
Well, it really depends.
Some implementations have specialized hash tables for strings that are
only available to the implementation. Since they have a more limited
set of inputs to the hash function, they can be tuned for better
performance. Since reading and interning symbols is fairly fundamental
it often gets a lot of attention in implementations.
Weighing against that is the need to create symbols, which are fairly
substantial objects in their own right. In addition to having the name
string stored, there are various slots for package, function, value,
etc. that need to be created. This will add some overhead to the
process of making the objects that are then used for the comparison.
--
Thomas A. Russ, USC/Information Sciences Institute
Here is a try. But I do not know how to make special characters be
interpreted as whitespace. Who can help?
Nicolas.
(defun wordcount (stream)
(let ((*readtable* (copy-readtable))
(table (make-hash-table :test 'eq)))
(setf (readtable-case *readtable*) :downcase)
#+(or)(loop for char in '(#\. #\; #\, #\' #\" #\:) do
(set-macro-character char nil nil)) ; does not work for me
(loop for word = (read stream nil) while word
do (incf (gethash word table 0)))
(loop for word being each hash-key of table
and count being each hash-value of table
do (format t "~A : ~A~%" word count))))
Maybe a preprocessing step instead?
1) Use readline to fill a string.
2) Strip special characters out of the string.
(remove-if-not #'alpha-or-space string)
3) Read from that string.
(read-from-string string)
It's more steps, but it allows read to be used in the normal way
(assuming read-from-string has similar behavior as read). One annoying
issue with read-from-string is that you have to keep track of an index
(where you are in the string), so multiple-value-bind is needed, so
read-from-string can't be nested in the function that uses its return
value... gets a bit messy. I would much prefer a destructive version of
read-from-string that just destroyed the part it had already read.
>Here is a try. But I do not know how to make special characters be
>interpreted as whitespace. Who can help?
>Nicolas.
>(defun wordcount (stream)
> (let ((*readtable* (copy-readtable))
> (table (make-hash-table :test 'eq)))
> (setf (readtable-case *readtable*) :downcase)
> #+(or)(loop for char in '(#\. #\; #\, #\' #\" #\:) do
> (set-macro-character char nil nil)) ; does not work for me
> (loop for word = (read stream nil) while word
> do (incf (gethash word table 0)))
> (loop for word being each hash-key of table
> and count being each hash-value of table
> do (format t "~A : ~A~%" word count))))
On 5 Jun 2005 19:04:03 -0700, <············@gmail.com> wrote:
>
>
> Maybe a preprocessing step instead?
> 1) Use readline to fill a string.
> 2) Strip special characters out of the string.
> (remove-if-not #'alpha-or-space string)
> 3) Read from that string.
> (read-from-string string)
>
> It's more steps, but it allows read to be used in the normal way
> (assuming read-from-string has similar behavior as read).
I haven't been following this at all, but since the read issue has
surfaced a couple of times, I point out:
http://www.emmett.ca/~sabetts/slurp.html
--
With sufficient thrust, pigs fly fine.
GP lisper <········@CloudDancer.com> writes:
> I haven't been following this at all, but since the read issue has
> surfaced a couple of times, I point out:
>
> http://www.emmett.ca/~sabetts/slurp.html
Urgh. See <http://www.tfeb.org/lisp/obscurities.html>
'as
On Mon, 06 Jun 2005 17:13:10 +0100, <··········@gmx.net> wrote:
> GP lisper <········@CloudDancer.com> writes:
>
>> I haven't been following this at all, but since the read issue has
>> surfaced a couple of times, I point out:
>>
>> http://www.emmett.ca/~sabetts/slurp.html
>
> Urgh. See <http://www.tfeb.org/lisp/obscurities.html>
Hmm, depends on the data in the file. The only "obscure bug" occurs
when the author fails to test his file surper with the real data (and
that bug is between the ears). The joy in Lisp is small testable
functions that are correct every step on the way to the complete
program.
The only bug I see here is the insistance on translating the stream as
it is read and the confusion between the file-reading-counter and the
stream-symbol-counter.
--
With sufficient thrust, pigs fly fine.
Turned out to be easy. Here is the complete version which should fit the
requirements of the shootut. And as much as I see it is very fast.
Nicolas.
(defun wordcount (&optional (stream *standard-input*)
&aux (*readtable* (copy-readtable)) (table (make-hash-table)))
;; tweak readtable
(loop for char across "\".;,#:()[]{}" do
(set-syntax-from-char char #\Space))
;; count
(loop for word = (read stream nil #\.) until (eq word #\.)
do (incf (gethash word table 0)))
;; output
(let ((*print-pretty* nil))
(loop for (word . count) in
(sort (loop for a being the hash-keys of table using (hash-value b)
collect (cons a b))
#'(lambda (a b)
(or (> (cdr a) (cdr b))
(string<= (car a) (car b)))))
do (format t "~D : ~A~%" count (string-downcase word)))))
;;; Testing:
(wordcount (make-string-input-stream "A b a hello.B, a Hello b"))
I have added a more complete test function and increased the number of
characters that need to be treated as spaces. I will also change the
format statement to match the benchmark requirements. There is still an
issue left, though. Entries must be sorted, primarily by number of
occurrences and secondarily by alphabet as shown on this page:
http://shootout.alioth.debian.org/iofile.php?test=wordfreq&lang=all&sort=fullcpu&file=output
So far this version is almost 200% faster than the previous version,
but sorting will probably reduce that significantly. Maybe 50% faster?
Still a very nice improvement.
plaza.ufl.edu/lavigne/word-frequency.lisp
plaza.ufl.edu/lavigne/word-frequency_neuss.lisp
>Entries must be sorted, primarily by number of
>occurrences and secondarily by alphabet as shown
>on this page:
>http://shootout.alioth.debian.org/iofile.php?test=wordfreq&lang=all&s...
>So far this version is almost 200% faster than
>the previous version, but sorting will probably
>reduce that significantly. Maybe 50% faster?
>Still a very nice improvement.
>plaza.ufl.edu/lavigne/word-frequency.lisp
>plaza.ufl.edu/lavigne/word-frequency_neuss.lisp
I was wrong about the sorting. Nicolas already performed a sort, just
not correctly. I fixed the mistake, and it is still just as fast. The
only problem I see now is the formatting. The numbers need to be right
justified in a column as on this page:
http://shootout.alioth.debian.org/iofile.php?test=wordfreq&lang=all&sort=fullcpu&file=output
Here is the format statement that *should* take care of that:
(format t "~7D : ~A~%" count (string-downcase word))
It doesn't work, though. It works correctly when I test this line alone
(substituting a number and a string instead of count and
(string-downcase word)) but when running the entire code all of the
numbers are flush against the left edge.
The latest version is at
plaza.ufl.edu/lavigne/word-frequency_neuss.lisp
Anyone see how to fix it?
"Eric Lavigne" <············@gmail.com> writes:
> I was wrong about the sorting. Nicolas already performed a sort, just
> not correctly. I fixed the mistake, and it is still just as fast.
This I get for insufficient testing:-) Good that you fixed it.
Mentally, I would slightly prefer the test
(or (> (cdr a) (cdr b))
(and (= (cdr a) (cdr b))
(string>= (car a) (car b))))
which should also be slightly more performant, but it does not really
matter.
> The only problem I see now is the formatting. The numbers need to be
> right justified in a column as on this page: ...
>
> Here is the format statement that *should* take care of that:
> (format t "~7D : ~A~%" count (string-downcase word))
Maybe a bug in CMUCL/SBCL? CLISP gives the right output.
Nicolas.
>Mentally, I would slightly prefer the test
> (or (> (cdr a) (cdr b))
> (and (= (cdr a) (cdr b))
> (string>= (car a) (car b))))
>which should also be slightly more performant, but it does not really
>matter.
I will give this a try.
>> The only problem I see now is the formatting. The numbers need to be
>> right justified in a column as on this page: ...
>> Here is the format statement that *should* take care of that:
>> (format t "~7D : ~A~%" count (string-downcase word))
>Maybe a bug in CMUCL/SBCL? CLISP gives the right output.
The shootout uses CMUCL, so it needs to work on CMUCL.
When I tested (format t "~7d ~a~%" 100 "cat") I got the right output.
Somehow this is different from when a similar line appears within a
larger function...
>>> The only problem I see now is the formatting. The numbers need to be
>>> right justified in a column as on this page: ...
>>> Here is the format statement that *should* take care of that:
>>> (format t "~7D : ~A~%" count (string-downcase word))
>>Maybe a bug in CMUCL/SBCL? CLISP gives the right output.
>The shootout uses CMUCL, so it needs to work on CMUCL.
>When I tested (format t "~7d ~a~%" 100 "cat") I got the right output.
>Somehow this is different from when a similar line appears within a
>larger function...
I found the problem, but my fix was a bit (a lot) kloodgey. The reader
needs to ignore numbers while processing our input file. If I tell it
to ignore all numbers *except 7* then this will work:
(format t "~7d ~a~%" count (string-downcase word))
So now the code gives perfect results on the example text. Just two
problems...
1) I got lucky that the example text didn't include the number 7. If
they ever change the example text...
2) Anyone who reads this code will wonder why 7 is special...
I think the right way is to change the code so that the new readtable
is only in effect while reading input, not while reading our own code.
I have not done this yet. The new code takes 1.2 seconds, compared to
3.2 seconds with the other version we were working on.
Latest version is up at plaza.ufl.edu/lavigne/word-frequency_neuss.lisp
>(defun wordcount (&optional (stream *standard-input*)
> &aux (new-readtable (copy-readtable))
> (table (make-hash-table)))
> (let ((*readtable* new-readtable))
Above is how the code looks now. What do you think of changing to this:
(defun wordcount (&optional (stream *standard-input*)
&aux (table (make-hash-table)))
(let ((*readtable* (copy-readtable)))
It is one line shorter, but it doesn't allow the user to specify the
readtable.
The current version is at
plaza.ufl.edu/lavigne/word-frequency_neuss.lisp
"Eric Lavigne" <············@gmail.com> writes:
> It is one line shorter, but it doesn't allow the user to specify the
> readtable.
This is also not possible with &aux. &aux is simply an abbreviation for an
outer let. The way I did write it earlier is the shortest. Or maybe even
(saving another line):
(defun wordcount (&optional (stream *standard-input*)
&aux (*print-pretty* t) (table (make-hash-table :test 'eq)))
Bye, Nicolas.
P.S.: That the readtable changes matter for FORMAT is quite interesting
(and should not happen IMO). I hope that the experts are reading this
thread. Otherwise one could discuss it also as a possible bug on the CMUCL
mailing list.
"Eric Lavigne" <············@gmail.com> writes:
> I have added a more complete test function and increased the number of
> characters that need to be treated as spaces. I will also change the
> format statement to match the benchmark requirements. There is still an
> issue left, though. Entries must be sorted, primarily by number of
> occurrences and secondarily by alphabet as shown on this page:
But my function should do this already!
Further, if you submit this, one can simply put code like
(let ((table (make-hash-table)))
...
(read *standard-input* ...)
...)
into the file (no function definition at all!). This should make it
shorter than almost any other language (while still being very fast).
Nicolas.
>This should make it
>shorter than almost any other
>language (while still being very fast)
Yes, it is very short and very fast. It is 2-3 times faster than the
other code we were working on. It still has a problem, though, because
the printing format is not quite right. See below.
>But my function should do this already!
>...
>Nicolas.
I don't know what *this* refers to, so I will take the possibilities
one at a time.
1) more complete test function
Your way of testing was to pass a string. My test function passes a
file. The test function will not be part of the final submission. It
only makes it easier for me to test your code. This is important,
because my tests uncovered mistakes.
2) Increased the number of characters treated as spaces
It now ignores numbers, dashes, single quotes, asterisks... This change
is necessary. If you don't think so, try running the old version with
this test string:
"Nicolas's code crashes when it sees a single quote followed by a
letter."
3) Entries must be sorted
I noticed that the entries weren't sorted properly, so I just assumed
they weren't sorted. This was wrong. Actually, it seems that you sorted
them alphabetically. To fix this I changed
#'(lambda (a b)
(or (> (cdr a) (cdr b))
(string<= (car a) (car b)))))
to
#'(lambda (a b)
(if (eql (cdr a) (cdr b))
(string>= (car a) (car b))
(> (cdr a) (cdr b)))))
I don't know why your sorting function doesn't work, but on my computer
(CMUCL, gentoo linux) it sorts by alphabet only.
4) Printing format
This should be simple, but it still needs to be done right. I have
changed your format to *almost* match the required format, but there is
still a problem.
(format t "~7D : ~A~%" count (string-downcase word))
The above format statement should print " 72 apple" but instead it
is printing "72 apple"
I need those spaces. Numbers should be right justified within a column
as on this page:
http://shootout.alioth.debian.org/iofile.php?test=wordfreq&lang=all&sort=fullcpu&file=output
The only problem still remaining is #4, formatted printing. It looks
correct to me in the code, but it doesn't print correctly.
My way of testing (using a test function that will be removed later) is
(time (test))
Latest version is here: plaza.ufl.edu/lavigne/word-frequency_neuss.lisp
"Eric Lavigne" <············@gmail.com> writes:
> The lisp code for the word-frequency benchmark on the shootout is
> giving bad results in the official tests:
> http://shootout.alioth.debian.org/benchmark.php?test=wordfreq&lang=cmucl&id=0&sort=fullcpu
>
> I believe this is because the same hashtable is being re-used in
> successive runs. Note that in two cases the lisp code gave an answer
> that was exactly 3 times the correct answer, suggesting that the code
> ran 3 times, increasing the numbers on each run.
>
> My fix is to add (clrhash *ht*) to the end of the main function:
> (defun main ()
> (readit)
> (printit)
> (clrhash *ht*))
>
> ##########################################################
> Before I submit this correction, does anyone have ideas for improving
> the speed?
> ##########################################################
Binding *PRINT-PRETTY* to nil in printit gives a 25% speedup in CMUCL.
Andras
>Binding *PRINT-PRETTY* to nil in printit gives a 25% speedup in CMUCL.
>Andras
Only a 5% speedup on my CMUCL, but this is still a very interesting
number. Printing was taking up 5% of the total runtime before this.
Apparently all of that time was spent on "prettiness" that had no
effect on appearance. I've uploaded the new version to
plaza.ufl.edu/lavigne/word-frequency.lisp
I am putting Nicolas' version at
plaza.ufl.edu/lavigne/word-frequency_neuss.lisp
though it isn't working yet.
"Eric Lavigne" <············@gmail.com> writes:
> I believe this is because the same hashtable is being re-used in
> successive runs. Note that in two cases the lisp code gave an answer
> that was exactly 3 times the correct answer, suggesting that the code
> ran 3 times, increasing the numbers on each run.
>
> My fix is to add (clrhash *ht*) to the end of the main function:
> (defun main ()
> (readit)
> (printit)
> (clrhash *ht*))
Well, depending on the Lisp implementation and various other factors, it
is sometimes faster to create a new hash table rather than clearing an
existing one.
Another tuning issue is to try to get a good estimate up-front for the
likely required size of the hash table. Re-hashing during hashtable
growth can be a significant source of slowdowns, since you end up with
something like n^2 performance on number of hashes required to populate
the table. (N + 1/2N + 1/4N ....)
--
Thomas A. Russ, USC/Information Sciences Institute
···@sevak.isi.edu (Thomas A. Russ) writes:
> Another tuning issue is to try to get a good estimate up-front for the
> likely required size of the hash table. Re-hashing during hashtable
> growth can be a significant source of slowdowns, since you end up with
> something like n^2 performance on number of hashes required to populate
> the table. (N + 1/2N + 1/4N ....)
This is not n^2 but 2*n.
--
__("< Marcin Kowalczyk
\__/ ······@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
Marcin 'Qrczak' Kowalczyk <······@knm.org.pl> writes:
>
> ···@sevak.isi.edu (Thomas A. Russ) writes:
>
> > Another tuning issue is to try to get a good estimate up-front for the
> > likely required size of the hash table. Re-hashing during hashtable
> > growth can be a significant source of slowdowns, since you end up with
> > something like n^2 performance on number of hashes required to populate
> > the table. (N + 1/2N + 1/4N ....)
>
> This is not n^2 but 2*n.
Thanks. It didn't quite feel right. I must have confused this with the
series N + (N-1) + (N-2) ....
--
Thomas A. Russ, USC/Information Sciences Institute