A simple Lisp program -- algorithm and speed issues

From: Hrvoje Niksic
Subject: A simple Lisp program -- algorithm and speed issues
Date: Tue, 11 Mar 1997 00:00:00 +0000
Message-ID: <kigpvx76q2k.fsf@jagor.srce.hr>

Here is an interesting, not entirely academic problem that me and a
colleague are "wrestling" with.  Say there is a file, containing
entries like this:

foo 5
bar 20
baz 4
foo 6
foobar 23
foobar 3
...

There are a lot of lines in the file (~10000), but many of the words
repeat (there are ~500 unique words).  We have endeavored to write a
program that would sum the occurences of each word, and display them
sorted alphabetically, e.g.:

bar 20
baz 4
foo 11
foobar 26
...

We meant to benchmark the program, to see how various algorithms and
languages worked, with the special accent to Lisp.  The program is
supposed to be a small model of a real-life example.  The results were
following:

1) C, using the sorted binary tree to store the values, and print them
   in order, recursively.  It took about 1s (to 1.5s) to process the
   file.

2) Lisp, using the binary tree approach.  The binary tree was
   implemented with each node of the tree having its CAR a list
   '("string" number), and CDR being a cons of left and right
   pointers.  This took about 10s to complete (Allegro CL).

3) Lisp, consing a simple associated list for each unique element, and
   updating the elements using (setf (assoc ...)).  The list is sorted
   in the end.  This took some 16s to complete (Allegro; CMU CL and
   GNU CL were much slower, about 32s), with the sorting part taking
   no perceivable amount of time.

4) Lisp, using a hash-table, and sorting it in the end
   (alphabetically).  This took 5s to complete, sorting taking no
   perceivable amount of time (again).

5) Perl, using its internal hash-tables.  The Perl code looked
   something like:

	  while (<>) {
	     ($a, $b) = split;
	     $x{$a} += $b;
	  }
	  foreach $_ (sort keys %x) {
	     print "$_ $x{$_}\n";
	  }

   It took 5s to complete.

Notes: without any processing the Lisp reads the lines in some 0.8
seconds.  All the additional time spent was due to processing.  All
the Lisp functions were compiled using `compile-file' followed by
`load', and the times were obtained via `(time (test-fun))' (no-gc
user-time was used).  The exception are the Perl and C times which
were obtained through `time' shell builtin.  The tests were performed
on a 486/66, without much swapping taking place.  The test results
times were mostly repeatable.


My questions are:

a) What is the "correct" algorithm to do it?  It seems to me that the
   tree approach should be much faster than the assoc approach.  But
   it isn't as much faster as I expected, probably because of not
   using the primitives like `assoc'.  What method would an
   experienced Lisp programmer use?

b) Why is the C implementation that much faster than either of the
   Lisp ones, even the hash-table approach?

c) Why is the Perl implementation as fast as the "best" Lisp one we
   could think of, although Lisp should be compiled into native code,
   while Perl is not.

d) Why does the Perl implementation seem to be the most elegant of
   all?  Perl's supposed to be write-only, dammit! :-)

-- 
Hrvoje Niksic <·······@srce.hr> | Student at FER Zagreb, Croatia
--------------------------------+--------------------------------
WWW:          World-Wide-Waste.  Waste management corporation, which
              handles the billions of tons of garbage generated by just
              about everybody these days.

Re: A simple Lisp program -- algorithm and speed issues Barry Margolin
- Re: A simple Lisp program -- algorithm and speed issues Paul F. Dietz
Re: A simple Lisp program -- algorithm and speed issues Marco Antoniotti
- Re: A simple Lisp program -- algorithm and speed issues Hrvoje Niksic
Re: A simple Lisp program -- algorithm and speed issues William D Clinger
- Re: A simple Lisp program -- algorithm and speed issues Hrvoje Niksic
  - Re: A simple Lisp program -- algorithm and speed issues William D Clinger
    - Re: A simple Lisp program -- algorithm and speed issues Hrvoje Niksic
    - Re: A simple Lisp program -- algorithm and speed issues William D Clinger
Re: A simple Lisp program -- algorithm and speed issues Kelly Murray
Re: A simple Lisp program -- algorithm and speed issues Pierpaolo Bernardi
Re: A simple Lisp program -- algorithm and speed issues Bill Dubuque
Re: A simple Lisp program -- algorithm and speed issues Thomas A. Russ
Re: A simple Lisp program -- algorithm and speed issues Christopher J. Vogt
- Re: A simple Lisp program -- algorithm and speed issues Rainer Joswig
- Re: A simple Lisp program -- algorithm and speed issues Sam Pilato

From: Barry Margolin
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Tue, 11 Mar 1997 00:00:00 +0000
Message-ID: <5g2pfm$4qr@pasilla.bbnplanet.com>

In article <···············@jagor.srce.hr>,
Hrvoje Niksic  <·······@srce.hr> wrote:
>a) What is the "correct" algorithm to do it?  It seems to me that the
>   tree approach should be much faster than the assoc approach.  But
>   it isn't as much faster as I expected, probably because of not
>   using the primitives like `assoc'.  What method would an
>   experienced Lisp programmer use?

The choice between a binary tree and a hash table is essentially a
time-space tradeoff.  Hash tables are fast but they can waste memory (they
need to be sparse in order to reduce collisions).  Binary trees use minimal
memory, but they do lots more comparisons (you have to compare the current
item to each node along the path).  There are fancier structures like
B-trees and B*-trees that reduce the search overhead.  I suggest you pick
up a good book on data structures (perhaps start with Knuth's "Sorting and
Searching" book).

>b) Why is the C implementation that much faster than either of the
>   Lisp ones, even the hash-table approach?

That's a good question.

>c) Why is the Perl implementation as fast as the "best" Lisp one we
>   could think of, although Lisp should be compiled into native code,
>   while Perl is not.

Perl compiles into byte code, which is very easy to interpret, so the
interpreter overhead is minimal.  Most of the time, therefore, is spent in
the Perl runtime routines, which *are* compiled into native code.

A few years ago I rewrote a medium-sized C program in Perl.  I had to make
a few little tweaks to the algorithm, but the resulting program was as fast
as the original C program.

>d) Why does the Perl implementation seem to be the most elegant of
>   all?  Perl's supposed to be write-only, dammit! :-)

Perl was designed with applications like this in mind.  It has built-in
commands for parsing simple tables (the split() function), accumulating
associative data (the += operator borrowed from C, and the simple
associative array notation), and for scanning through standard input (the
while(<>) construct).  Once you get into the Perl mind-set, things like
this become quite intuitive.

One general comment: Lisp often seems to come up short when you use trivial
benchmarks.  The most blatant example is the ubiquitous "Hello, world"
program; a Lisp executable that does this will often contain the entire
Lisp runtime as well, even though most of it is unused.  Where Lisp excels
is not in simple programs, but in big applications.  That's where you can
take advantage of the large library of built-in data structures and
functions, the expressive power of Lisp's macros, and the development
environments that encourage exploratory programming and rapid prototyping.
The resulting programs might not necessarily be as fast as their C versions
(but then again, they might), but they're quite likely to be less buggy and
be completed much sooner (consider that a C version of your hash-table
example would require you to spend time implementing hashing, which is more
complex than the application-specific part of your program).
-- 
Barry Margolin
BBN Corporation, Cambridge, MA
······@bbnplanet.com
(BBN customers, call (800) 632-7638 option 1 for support)

From: Paul F. Dietz
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Wed, 12 Mar 1997 00:00:00 +0000
Message-ID: <5g5c24$7a9@nntp.interaccess.com>

Barry Margolin <······@bbnplanet.com> wrote:

>The choice between a binary tree and a hash table is essentially a
>time-space tradeoff.  Hash tables are fast but they can waste memory (they
>need to be sparse in order to reduce collisions).  Binary trees use minimal
>memory, but they do lots more comparisons (you have to compare the current
>item to each node along the path).

Why should hash tables use more memory than a binary tree?  The naive
implementation of binary trees has three pointers per item (one to the
item itself, two to the children, as well as some bits for balance
information).  A hash table of pointers to objects has (size of
table)/(number of objects) pointers per object (assuming collision
resolution by some method other than chaining.)  Even if the hash
table is given a density of 1/2 it still comes out ahead.

You could get the memory of the binary tree down to n + o(n),
using dense arrays for terminal subtrees of size O(log n), but I doubt
anyone really does that.

	Paul

From: Marco Antoniotti
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Tue, 11 Mar 1997 00:00:00 +0000
Message-ID: <s08209m5t74.fsf@crawdad.ICSI.Berkeley.EDU>

Can you post the Lisp code?
-- 
Marco Antoniotti - Resistente Umano
===============================================================================
	...e` la semplicita` che e` difficile a farsi.
				Bertholdt Brecht

From: Hrvoje Niksic
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Tue, 11 Mar 1997 00:00:00 +0000
Message-ID: <kigg1y28jlf.fsf@jagor.srce.hr>

Marco Antoniotti <·······@crawdad.icsi.berkeley.edu> writes:

> Can you post the Lisp code?

Well, I didn't want to make the article too big.  OK, the
tree and hash implementations follow.  I'd like to hear comments on
the coding style (especially for hash.lisp), and if there is a way to
make those functions faster for large input files.

tree.lisp:
----------
(defun tree-updated (root-node value)
  "Add new element to binary tree."
  (cond ((not root-node)
	 (cons value (cons nil nil)))
	((string-equal (car value) (caar root-node))
	 (setf (cadar root-node) (+ (cadar root-node) (cadr value)))
	 root-node)
	((string-lessp (car value) (caar root-node))
	 (setf (cadr root-node) (tree-updated (cadr root-node) value))
	 root-node)
	(t
	 (setf (cddr root-node) (tree-updated (cddr root-node) value))
	 root-node)))

(defun tree-inorder (root-node)
  "List tree starting with the leftmost node."
  (when root-node
    (tree-inorder (cadr root-node))
    (print (car root-node))
    (tree-inorder (cddr root-node))))

(defun tdp ()
  "Just for testing tree structures."
  (let (tree-root-node)
    (with-open-file (in "junk2" :direction :input)
      (let (n)
	(do ((line (read-line in nil) (read-line in nil)))
	    ((not line))
	  (setq n (position #\Space line)
		tree-root-node (tree-updated
                                 tree-root-node
				 (list (subseq line 0 n)
				       (parse-integer line :start n))))))
      (tree-inorder tree-root-node))))


hash.lisp:
----------
(defun hash-table-update (hash str value)
  "Add new element to hash table."
  (setf (gethash str hash) (+ value (gethash str hash 0))))

(defun hash-table-print (table)
  "Print hash table."
  (let (lst)
    (maphash #'(lambda (arg1 arg2)
		 (setq lst (cons arg1 lst)))
	     table)
    (mapc #'(lambda (arg)
	      (print arg)
	      (print (gethash arg table)))
	  (sort lst #'string-lessp)))
  t)

(defun hdp ()
  "Just for testing hash tables."
  (let ((hash-table (make-hash-table :test 'equal)))
    (with-open-file (in "junk2" :direction :input)
		    (let (n)
		      (do ((line (read-line in nil) (read-line in nil)))
			  ((not line))
			(setq n (position #\Space line))
			(hash-table-update hash-table (subseq line 0 n)
					   (parse-integer line :start n)))))
    (hash-table-print hash-table)))


-- 
Hrvoje Niksic <·······@srce.hr> | Student at FER Zagreb, Croatia
--------------------------------+--------------------------------
* Q: What is an experienced Emacs user?
* A: A person who wishes that the terminal had pedals.

From: William D Clinger
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Tue, 11 Mar 1997 00:00:00 +0000
Message-ID: <33258F92.174D@ccs.neu.edu>

I am not surprised that the hash table was fastest.  It would
probably be fastest in C also.

> b) Why is the C implementation that much faster than either of the
>    Lisp ones, even the hash-table approach?

Is it possible that you are comparing unsafe C code (there isn't
any other kind) with safe Lisp code?  Is it possible that you are
comparing arithmetic mod 2^32 with unlimited precision integer
arithmetic?

The string operations are another thing to look out for.  For
example, the string comparison functions of Common Lisp take
keyword arguments, and it might take more time to process those
keyword arguments than to perform the actual string comparison;
if you're not using keyword arguments, then you are probably
copying the input string from a buffer into a separate string,
which has its own expense.  You might collect all the arguments
that are passed for each call to a string comparison function,
and then see how long the string comparisons take by themselves.

Lots of things could be going on, and it's hard to guess without
seeing the actual code.

Will

From: Hrvoje Niksic
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Tue, 11 Mar 1997 00:00:00 +0000
Message-ID: <kigendm8iav.fsf@jagor.srce.hr>

William D Clinger <····@ccs.neu.edu> writes:

> > b) Why is the C implementation that much faster than either of the
> >    Lisp ones, even the hash-table approach?
> 
> Is it possible that you are comparing unsafe C code (there isn't
> any other kind) with safe Lisp code?  Is it possible that you are
> comparing arithmetic mod 2^32 with unlimited precision integer
> arithmetic?
[...]
> Lots of things could be going on, and it's hard to guess without
> seeing the actual code.

All of these are quite possible, and I am very curious as to how to
avoid either.  I have posted the Lisp code to the newsgroup, so you
can inspect it and see what have I done wrong.  I no longer have the C
code, but it should be easy enough to write.

-- 
Hrvoje Niksic <·······@srce.hr> | Student at FER Zagreb, Croatia
--------------------------------+--------------------------------
* Q: What is an experienced Emacs user?
* A: A person who wishes that the terminal had pedals.

From: William D Clinger
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Tue, 11 Mar 1997 00:00:00 +0000
Message-ID: <3325AFBD.49BC@ccs.neu.edu>

Thanks for posting the Common Lisp code.  I spent about 15 minutes
on this, and came up with the following timings using Sun (Lucid)
Common Lisp and Chez Scheme on a SPARC.

1.  original code with default optimization        4.0 seconds
2.  original code with optimizations enabled (*)   3.7
3.  using STRING= and STRING<                      3.5
4.  after defining TREE-UPDATED to return NIL      3.4
5.  the obvious translation of (3) into Scheme     1.2

(*) For (2) through (4) I incanted 
    (proclaim '(optimize (speed 3) (safety 0) (compilation-speed 0)))
                                              ^^^^^^^^^^^^^^^^^^^^^
                                              Only for Sun CL
These numbers show that:

Almost all of the time is being spent in TDP, not in TREE-UPDATED
or TREE-INORDER.

A fairly large chunk of this time must be due to inefficiencies
in the way that Sun Common Lisp implements things like READ-LINE,
POSITION, SUBSEQ, and PARSE-INTEGER.  In the Scheme version I
wrote READ-LINE and POSITION myself in Scheme (although I could
have pulled them out of SLIB), and used the standard SUBSTRING
and STRING->NUMBER operations.  The Common Lisp equivalents are
more general, but tend to be considerably slower for that very
reason.

If I really cared about making this code fast, I would:

    *  Use Scheme instead of Common Lisp.

    *  Read characters into a fixed buffer, and never allocate
       a string for a word that had already been seen.

    *  Rewrite TREE-UPDATED to transform the non-tail calls,
       most of which are actually unnecessary, into tail calls
       that are always necessary.

Will

--------
; Scheme code used to obtain timing (5).

(define (tree-updated root-node value)
  "Add new element to binary tree."
  (cond ((null? root-node)
         (cons value (cons '() '())))
        ((string=? (car value) (caar root-node))
         (set-car! (cdar root-node) (+ (cadar root-node) (cadr value)))
         root-node)
        ((string<? (car value) (caar root-node))
         (set-car! (cdr root-node) (tree-updated (cadr root-node) 
value))
         root-node)
        (else
         (set-cdr! (cdr root-node) (tree-updated (cddr root-node) 
value))
         root-node)))

(define (tree-inorder root-node)
  "List tree starting with the leftmost node."
  (if (not (null? root-node))
      (begin
    (tree-inorder (cadr root-node))
    (write (car root-node))
    (newline)
    (tree-inorder (cddr root-node)))))

(define (tdp)
  "Just for testing tree structures."
  (let ((tree-root-node '()))
    (call-with-input-file
     "junk2"
     (lambda (in)
       (let ((n 0))
         (do ((line (read-line in) (read-line in)))
             ((zero? (string-length line)))
             (set! n (position #\Space line))
             (set! tree-root-node (tree-updated
                                   tree-root-node
                                   (list (substring line 0 n)
                                         (string->number
                                          (substring line
                                                     (+ n 1)
                                                     (string-length 
line))))))))
       (tree-inorder tree-root-node)))))

(define (position c s)
  (let ((n (string-length s)))
    (define (loop i)
      (cond ((= i n)
             -1)
            ((char=? c (string-ref s i))
             i)
            (else
             (loop (+ i 1)))))
    (loop 0)))
    

(define (read-line in)
  (define (loop i)
    (let ((c (read-char in)))
      (cond ((eof-object? c)
             (substring line-buffer 0 i))
            ((char=? c #\newline)
             (substring line-buffer 0 i))
            (else
             (string-set! line-buffer i c)
             (loop (+ i 1))))))
  (loop 0))

(define line-buffer (make-string 1024 #\space))

From: Hrvoje Niksic
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Tue, 11 Mar 1997 00:00:00 +0000
Message-ID: <kigu3mikyvm.fsf@jagor.srce.hr>

William D Clinger <····@ccs.neu.edu> writes:

> Thanks for posting the Common Lisp code.  I spent about 15 minutes
> on this, and came up with the following timings using Sun (Lucid)
> Common Lisp and Chez Scheme on a SPARC.

Thanks for analyzing it.

> 3.  using STRING= and STRING<                      3.5
> 4.  after defining TREE-UPDATED to return NIL      3.4
[...]
> Almost all of the time is being spent in TDP, not in TREE-UPDATED
> or TREE-INORDER.

Yes, I have rechecked with CMU CL on a sparc, and you are right --
time spent in `tree-inorder' seems to be negligible (which is *very*
different from our previous results, which led us to a wrong track).

> 5.  the obvious translation of (3) into Scheme     1.2

I'm not sure I understand why Scheme would offer such a speed-up over
Lisp.  Better implementation of read-*?

> A fairly large chunk of this time must be due to inefficiencies
> in the way that Sun Common Lisp implements things like READ-LINE,
> POSITION, SUBSEQ, and PARSE-INTEGER.

This is quite possible.

> If I really cared about making this code fast, I would:
> 
>     *  Use Scheme instead of Common Lisp.

But I don't *want* to use Scheme.  I'd like to see whether Common Lisp
can be (nearly) as fast as C when doing this sort of thing.  Remember,
this is no FORTRAN-oid code -- Lisp should excel at such things,
shouldn't it?

>     *  Read characters into a fixed buffer, and never allocate
>        a string for a word that had already been seen.

Yes, this might be a good recommendation.  A good profiler would
probably bring miracles to that code.

>     *  Rewrite TREE-UPDATED to transform the non-tail calls,
>        most of which are actually unnecessary, into tail calls
>        that are always necessary.

What do you mean by this?  To remove the recursion?  Lisp compilers
claim to have removed the tail recursion in `tree-updated', but it's
not all that important, as the program spends little time there.

> ; Scheme code used to obtain timing (5).
[...]
>   (cond ((null? root-node)
>          (cons value (cons '() '())))
>         ((string=? (car value) (caar root-node))

Horrible. :-)  So similar to Lisp, and yet different!

-- 
Hrvoje Niksic <·······@srce.hr> | Student at FER Zagreb, Croatia
--------------------------------+--------------------------------
WWW:          World-Wide-Waste.  Waste management corporation, which
              handles the billions of tons of garbage generated by just
              about everybody these days.

From: William D Clinger
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Tue, 11 Mar 1997 00:00:00 +0000
Message-ID: <3325BD41.2781@ccs.neu.edu>

Hrvoje Niksic, quoting me, wrote:
> >     *  Rewrite TREE-UPDATED to transform the non-tail calls,
> >        most of which are actually unnecessary, into tail calls
> >        that are always necessary.
> 
> What do you mean by this?

I didn't say this very well, so I'll try again.

Most of the side effects that are performed by TREE-UPDATED are
unnecessary.  The only necessary side effects occur at a leaf,
when it is transformed into a non-leaf.  Rewriting TREE-UPDATED
to perform these side effects only when necessary not only
reduces the number of side effects, but has the happy result
of transforming almost all of the non-tail-recursive calls into
tail-recursive calls, which should of course be much faster.

This would hardly affect the timing for the benchmark, at least
for Common Lisp, because the Common Lisp version of the benchmark
appears to be spending almost all of its time in READ-LINE, POSITION,
SUBSEQ, and perhaps in PARSE-INTEGER.  If these inefficiencies were
fixed, however, then the transformation I suggested might provide an
additional factor-of-two speedup, provided of course that your
implementation of Common Lisp is properly tail-recursive.

Quick-and-dirty Scheme code for the transformed version of
TREE-UPDATED is shown below.  The equivalent Common Lisp and C
code is left as an exercise for the reader.

I will now return to my regularly scheduled life, which is
already in progress.

Will

--------
(define (tree-updated root-node value)
  (define (loop parent setter root-node)
    (cond ((null? root-node)
           (setter parent (cons value (cons '() '()))))
          ((string=? (car value) (caar root-node))
           (set-car! (cdar root-node) (+ (cadar root-node) (cadr value))))
          ((string<? (car value) (caar root-node))
           (loop (cdr root-node) set-car! (cadr root-node)))
          (else
           (loop (cdr root-node) set-cdr! (cddr root-node)))))
  "Add new element to binary tree."
  (cond ((null? root-node)
         (cons value (cons '() '())))
        ((string=? (car value) (caar root-node))
         (set-car! (cdar root-node) (+ (cadar root-node) (cadr value)))
         root-node)
        ((string<? (car value) (caar root-node))
         (loop (cdr root-node) set-car! (cadr root-node))
         root-node)
        (else
         (loop (cdr root-node) set-cdr! (cddr root-node))
         root-node)))

From: Kelly Murray
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Tue, 11 Mar 1997 00:00:00 +0000
Message-ID: <5g4n18$2ec@sparky.franz.com>

In article <···············@jagor.srce.hr>, Hrvoje Niksic <·······@srce.hr> writes:
>> a file containing entries like this:
>> 
>> foo 5
>> bar 20
>> baz 4
>> foo 6

>> program that would sum the occurences of each word, and display them
>> sorted alphabetically, e.g.:
>> 
>> bar 20
>> baz 4
>> foo 11
>> 

My 10-minutes-of-thinking solution for doing this efficiently,
and not necessarily the simplest way, is to index off the first letter
to break up the data into smaller chunks that can be searched 
and sorted efficiently with alists.  This strategy would slow down
for very large inputs since the alists would get too large.

The next level of sophistication would be to keep a count
of the entries and then switch trategies once they got too large.
You could break the implementation up into "add-entry" functions
and use the same recursive strategy, except now
index off the next letter of the word for the next level down entries.
Then it starts to become more like an adaptive dynamic
b-tree-like solution.

-Kelly Murray    ···@franz.com    http://www.franz.com

(defun sort-words (in out)
 (loop
    with array = (make-array 256)
    for line = (read-line in nil)
    while line
    for space = (position #\space line)
    for word = (subseq line 0 space)
    for index = (char-int (char word 0))
    for count = (parse-integer line :start space)
    for entry = (assoc word (aref array index) :test #'string=)
    finally
      (loop 
         for index from 0 to 255
         for alist = (aref array index)
	do
         (when alist
	  (loop for entry in (sort alist #'car)
	      do
	      (format out "~%~A ~D" (car entry) (cdr entry)))))
     do
     (if entry
       (incf (cdr entry) count)
       (push (cons word count) (aref array index)))))

From: Pierpaolo Bernardi
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Thu, 13 Mar 1997 00:00:00 +0000
Message-ID: <5g8fdp$15sm@serra.unipi.it>

Hrvoje Niksic (·······@srce.hr) wrote:

[snip]

: There are a lot of lines in the file (~10000), but many of the words
: repeat (there are ~500 unique words).  We have endeavored to write a
: program that would sum the occurences of each word, and display them
: sorted alphabetically, e.g.:

[snip]

Why not the following:

read all of the data in a vector,
sort the vector on the strings,
do a pass to sum up the numbers.

Without testing, I think that this should be faster than your solutions.
Depending on the size of the actual problem, this may be a viable algorithm,
or not.

Pierpaolo.

From: Bill Dubuque
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Thu, 13 Mar 1997 00:00:00 +0000
Message-ID: <y8zlo7sm2wg.fsf@berne.ai.mit.edu>

Hrvoje Niksic <·······@srce.hr> writes:
>
> Here is an interesting, not entirely academic problem that me and a
> colleague are "wrestling" with.  Say there is a file, containing
> entries like this:
> 
> foo 5
> bar 20
> baz 4
> foo 6
> foobar 23
> foobar 3
> ...
> 
> There are a lot of lines in the file (~10000), but many of the words
> repeat (there are ~500 unique words).  We have endeavored to write a
> program that would sum the occurences of each word, and display them
> sorted alphabetically, e.g. ...

The following 10-minute hack runs in 2.47 seconds in ACL/Win on a 
486 66/2 PC, 2.31 seconds of which are spent reading the 10000 
line file. This will probably be the bottleneck in any CL, since 
file operations go through more levels of wrapping than in C.

The approach is simply to READ the data file, while in a temporary 
package, sum the values into the symbol-value slot, and finally collect
and sort all the symbols in the package. Most likely this will be
the fastest approach in most CL implementations because it exploits
the (presumably) efficient READ primitive, which will probably be 
more efficient than any equivalent parsing done from user code.
I'd be curious to see times for this code in other CL's.

ACL/Win can be downloaded from Franz's home page: http://www.franz.com/

-Bill Dubuque

(make-test-file "c:\\test.out")

;; Time the READ, EVAL, and SORT

(time (hrvoje-chomp "c:\\test.out"))

Execution time: 2.47 seconds plus 0.61 seconds garbage collecting

;; Time just the READ

(time (hrvoje-chomp "c:\\test.out" :eval nil))

Execution time: 2.31 seconds plus 0.55 seconds garbage collecting


(defun hrvoje-chomp (path &key (eval t) (sort t))
  (with-open-file (instream path :direction :input)
    (let* ((*package* (make-package (gensym) :use ()))
	   (stream (make-concatenated-stream
		     (make-string-input-stream "(")
		     instream
		     (make-string-input-stream ")")))
	   (data (read stream)))
      (when eval
	;; Sum each value into the symbol-value slot
	(do ((tail data (cddr tail)))
	    ((null tail))
	  (set (first tail)
	       (+ (if (boundp (first tail))
		      (symbol-value (first tail))
		      0)
		  (second tail))))
	(when sort
	  ;; Collect (symbol . value), sort lexicographically
	  (let (result)
	    (do-symbols (symbol *package*)
	      (push (cons (symbol-name  symbol)
			  (symbol-value symbol))
		    result))
	    (sort result #'(lambda (x y)
			     (string< (car x) (car y))))
	    (values)))))))

(defun make-test-file (path &optional (nwords 500) (nblocks 20))
  (let (strings)
    (dotimes (i nwords)
      (push (make-random-string 4)
	    strings))
    (with-open-file (outstream path :direction :output)
      (dotimes (i nblocks)
	(dolist (string strings)
	  (format outstream "~A ~D~%" string (random 100)))))))

(defun make-random-string (length &aux (string (make-string length)))
  (dotimes (i length  string)
    (setf (char string i) (code-char (+ (char-code #\A) (random 26))))))

From: Thomas A. Russ
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Thu, 13 Mar 1997 00:00:00 +0000
Message-ID: <ymi913rg76m.fsf_-_@hobbes.isi.edu>

In article <···············@jagor.srce.hr> Hrvoje Niksic <·······@srce.hr> writes:

 > From: Hrvoje Niksic <·······@srce.hr>
 > Newsgroups: comp.lang.lisp
 > Date: 11 Mar 1997 04:54:43 +0100
 > Here is an interesting, not entirely academic problem that me and a
 > colleague are "wrestling" with.  Say there is a file, containing
 > entries like this:
 > 
 > foo 5
 > bar 20
 > baz 4
 > foo 6
 > foobar 23
 > foobar 3
 > ...

 > 2) Lisp, using the binary tree approach.  The binary tree was
 >    implemented with each node of the tree having its CAR a list
 >    '("string" number), and CDR being a cons of left and right
        ^^^^^^^^

Of course, this might be a mistake right here, since for lisp you would
really want to have a symbol as the key rather than a string.  Symbol
comparison operations are much faster than string comparison.

What lisp code are you using to read in the data?
	
 >    pointers.  This took about 10s to complete (Allegro CL).
 > 
 > 3) Lisp, consing a simple associated list for each unique element, and
 >    updating the elements using (setf (assoc ...)).  The list is sorted
 >    in the end.  This took some 16s to complete (Allegro; CMU CL and
 >    GNU CL were much slower, about 32s), with the sorting part taking
 >    no perceivable amount of time.
 > 
 > 4) Lisp, using a hash-table, and sorting it in the end
 >    (alphabetically).  This took 5s to complete, sorting taking no
 >    perceivable amount of time (again).

I would suggest:

(defun read-sort-and-output (inputFilename outputFilename)
  ;; Generally you want a hash table's initial size to be about 2x the
  ;; number of expected entries.
  (let ((ht (make-hash-table :size 1000 :test #'eq))
	(result nil))
    (with-open-file (input inputFilename :direction :input)
      (loop for key = (read input nil :eof)
            as value = (read input nil :eof)
	    until (or (eql key :eof) (eql value :eof))
	    do (setf (gethash key ht) (+ (gethash key ht 0) value))))
    (maphash #'(lambda (k v) (push (list (symbol-name k) v) result)) ht)
    (with-open-file (output outputFileName :direction :output)  
      (format output "~{~{~A ~D~%~}~}" (sort result #'string< :key #'first)) )))

If you don't like obscure format code, then you could use an explicit
loop instead:

      (loop for item in result
	    do (format output "~A ~A~%" (first item) (second item)))

For faster processing, include declarations in the arithmetic code:

     do (setf (gethash key ht)
              (the fixnum (+ (the fixnum (gethash key ht 0))
	                     (the fixnum value))))

 > 
 > a) What is the "correct" algorithm to do it?  It seems to me that the
 >    tree approach should be much faster than the assoc approach.  But
 >    it isn't as much faster as I expected, probably because of not
 >    using the primitives like `assoc'.  What method would an
 >    experienced Lisp programmer use?

See above.

 > b) Why is the C implementation that much faster than either of the
 >    Lisp ones, even the hash-table approach?

Possibly because EQUAL hash tables (which is the type you need to use
for string keys) are slow compared to EQ or EQL hash tables.

 > c) Why is the Perl implementation as fast as the "best" Lisp one we
 >    could think of, although Lisp should be compiled into native code,
 >    while Perl is not.
 > 
 > d) Why does the Perl implementation seem to be the most elegant of
 >    all?  Perl's supposed to be write-only, dammit! :-)

Depends on your view.  I think the lisp version is pretty compact, even
if it doesn't have the nice line parsing approach.


 > 
 > -- 
 > Hrvoje Niksic <·······@srce.hr> | Student at FER Zagreb, Croatia
 > --------------------------------+--------------------------------
 > WWW:          World-Wide-Waste.  Waste management corporation, which
 >               handles the billions of tons of garbage generated by just
 >               about everybody these days.

 > 
-- 
Thomas A. Russ,  USC/Information Sciences Institute          ···@isi.edu

From: Christopher J. Vogt
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Wed, 19 Mar 1997 00:00:00 +0000
Message-ID: <vogt-1903971224340001@204.248.25.27>

In article <···············@jagor.srce.hr>, Hrvoje Niksic
<·······@srce.hr> wrote:
[...]

> My questions are:
> 
> a) What is the "correct" algorithm to do it?  It seems to me that the
>    tree approach should be much faster than the assoc approach.  But
>    it isn't as much faster as I expected, probably because of not
>    using the primitives like `assoc'.  What method would an
>    experienced Lisp programmer use?
> 
> b) Why is the C implementation that much faster than either of the
>    Lisp ones, even the hash-table approach?
> 
> c) Why is the Perl implementation as fast as the "best" Lisp one we
>    could think of, although Lisp should be compiled into native code,
>    while Perl is not.
> 
> d) Why does the Perl implementation seem to be the most elegant of
>    all?  Perl's supposed to be write-only, dammit! :-)

Sorry I'm so late in responding to this.  I have a Mac Quadra 700 running
MCL 2.1, so all numbers listed below pertain to this environment, YMMV.

The general answer to why lisp is so much slwer in your test code, is that
I/O in Lisp is notoriously *SLOW*.  I've worked for a company where we
used foreign function calling in C to handle all the I/O, that's how slow
Lisp I/O can be.

I ran your code and I found that 60% of the time is in printing and 25% is
in reading.  So, it is spending 85% of it's time doing what Lisp does
worse than just about any other language.  Note that not all I/O is
necessarily equal.  By printing to a file rather than to a window, I
doubled the overal performance!  Not that printing to a file is fast, but
it is much faster than printing to a listener.  I'll bet that if you got
rid of the reading and printing in both the C and the Lisp version (start
with the data already in a list for example and don't print) that the Lisp
version would be as fast as the C version.  Try it and let me know.

Of all the forms you can represent your data in, Lisp is least efficient
at dealing with strings, so try to avoid them.  Use symbols instead.  Now,
in your case, maybe you *REALLY* need the string, but if you could use
'foo instead of "foo" you'll find you get better performance almost
regardless of what it is you are doing.

Watch out for consing, this can be a much bigger time sink than many
people realize.  Just rewriting code to reduce consing can easily speed up
a routine by a factor of 2.

Below is the code I ended up with, which is about 3.5x faster than your
original code (which if you get a similar speed up, makes it as fast as
C).  My code isn't EQ your code, but it is EQUAL.

I hope this is of some help.

(proclaim '(inline 'my-hash-table-update)) ; small routine, in-line it for
speedup

(defun my-hash-table-update (hash str value)
  (setf (gethash str hash) (+ value (gethash str hash 0))))

(defun my-hash-table-print (table)
  (let (list)
    (with-open-file (out "snow white:desktop folder:out" :direction :output)
      (maphash #'(lambda (key value)
                   (push (cons key value) list))
               table)
      (setq list (sort list #'(lambda (cons1 cons2) 
                                (string-lessp (first cons1) (first cons2)))))
      (loop for i in list do 
            (print (first i) out) ; print to a file is faster than to a listener
            (print (cdr i) out))))
  t)

(defun my-hdp ()
  (let ((hash-table (make-hash-table :test 'eq))) ; eq is faster than
equal if it will work
    (with-open-file (in "snow white:desktop folder:data" :direction :input)
      (loop for key = (read in nil nil)    ; don't use strings unless you must
            for value = (read in nil nil)  ; don't read, and then parse
            while key do
            (my-hash-table-update hash-table key value)))
    (my-hash-table-print hash-table)))

-- 
····@novia.net
sl-Omaha, NE

From: Rainer Joswig
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Wed, 19 Mar 1997 00:00:00 +0000
Message-ID: <joswig-ya023180001903972115480001@news.lavielle.com>

In article <·····················@204.248.25.27>, ····@novia.net
(Christopher J. Vogt) wrote:

> Sorry I'm so late in responding to this.  I have a Mac Quadra 700 running
> MCL 2.1, so all numbers listed below pertain to this environment, YMMV.

MCL 4.0 may be a bit faster. ;-) Time to upgrade. ;-)

If you want to know what you can do with MCL nowadays see:
http://sk8.research.apple.com/ and prepare some free disk
space.

> I ran your code and I found that 60% of the time is in printing and 25% is
> in reading.  So, it is spending 85% of it's time doing what Lisp does
> worse than just about any other language.  Note that not all I/O is
> necessarily equal.  By printing to a file rather than to a window, I
> doubled the overal performance!  Not that printing to a file is fast, but
> it is much faster than printing to a listener.  I'll bet that if you got
> rid of the reading and printing in both the C and the Lisp version (start
> with the data already in a list for example and don't print) that the Lisp
> version would be as fast as the C version.  Try it and let me know.

The MCL listener is a full editor. There maybe other ways to get
a faster output to windows. Recent MCL versions have
recoded some of the editor stuff in Lisp for use on the PowerPC.
Scrolling is faster now.

See also:

- optimize declarations (speed, space, debug, safety, compilation-speed)
- type declarations (also in LOOP!)
- remove a level of dispatching in reading and printing
  see ccl::stream-reader

STREAM-READER stream 
[Generic Function]
returns two values, a function and a value. Applying the function to the
value is equivalent to applying stream-tyi to the stream, but is usually
much faster. Users can specialize stream-reader, but they need to be
sure that there are no stream-tyi methods specialized on a subclass of
the class on which the stream-reader method is specialized. The
maybe-default-stream-reader macro knows how to ensure that there
are no such stream-tyi methods.

- don't print to a scrolling window, it will use most
  of the time for scrolling

-- 
http://www.lavielle.com/~joswig/

From: Sam Pilato
Subject: Re: A simple Lisp program -- algorithm and speed issues
Date: Sun, 23 Mar 1997 00:00:00 +0000
Message-ID: <SamPi-2303970104260001@sampi.tiac.net>

In article <·····················@204.248.25.27>, ····@novia.net
(Christopher J. Vogt) wrote:

> In article <···············@jagor.srce.hr>, Hrvoje Niksic
> <·······@srce.hr> wrote:
> [...]

> I hope this is of some help.
> 
> (proclaim '(inline 'my-hash-table-update)) ; small routine, in-line it for
> speedup
> 
> (defun my-hash-table-update (hash str value)
>   (setf (gethash str hash) (+ value (gethash str hash 0))))
[...]


I would replace:
    (setf (gethash str hash) (+ value (gethash str hash 0)))
with:
    (incf (gethash str hash 0) value)

That's probably faster, and it's cleaner.