Lisps memory usage (was: Efficient way to create strings? (was: How to Write a Spelling Corrector))

From: ·············@googlemail.com
Subject: Lisps memory usage (was: Efficient way to create strings? (was: How 	to Write a Spelling Corrector))
Date: Mon, 27 Oct 2008 19:59:34 +0000
Message-ID: <b5088ca6-b158-43ce-8a22-63e17bf1fb2b@s50g2000hsb.googlegroups.com>

> There was a discussion sometime ago regarding a lisp version of
> Norvig's spell checking program.
> http://groups.google.com/group/comp.lang.lisp/browse_thread/thread/78...
> However, the main concern at the time was to minimize line count.
> I was curious and wrote my version that optimizes for readability.
> However, when I tried it out I noticed that it is _very_ slow. Norvig
> claimed that the first test case (270 words) took 13 seconds.
> Here's my result (P4 2.8Ghz SBCL 1.0.3 linux)
> SPELLCHECK> (time (spelltest *tests1* :verbose t))
> ...
> User time    =  0:04:57.094
> System time  =        0.840
> Elapsed time =  0:05:14.309
> Allocation   = 1581329808 bytes
> 0 Page faults
> That is simply unacceptable. So I tried a few examples in the above
> thread and the results were either much worst or _slightly_ better
> (within 10% range).

Same results here. My program is located at http://www.profibing.de/lisp/spell/,
Lisp is ClozureCL 1.3(trunk) 64 bit, Mac OSX.

I tried to minimize consing with a macro DOEDITS.

(defun edits1 (word)
  "Create a list of all words with Levenshtein distance of one."
  (let (words)
    (doedits (e1 word)
      (pushnew (copy-seq e1) words :test #'string=))
    words))

(defun edits2 (word)
  "Create a list of all words with Levenshtein distance of two."
   (let (words)
     (doedits (e1 word)
       (doedits (e2 e1)
         (pushnew (copy-seq e2) words :test #'string=)))
       words))

But (length (edits2 "something")) needs 630 seconds, while the Python
version does it at a blink.

The benchmark runs with comparable time but it uses a *lot* of memory.

Can we do better? How comes it, that Python is so fast, even as a
interpreted language?

--
Bernd

Re: Lisps memory usage (was: Efficient way to create strings? (was: How to Write a Spelling Corrector)) Rainer Joswig
- Re: Lisps memory usage (was: Efficient way to create strings? (was: How to Write a Spelling Corrector)) Marek Kubica
Re: Lisps memory usage (was: Efficient way to create strings? (was: How to Write a Spelling Corrector)) Scott Burson
Re: Lisps memory usage (was: Efficient way to create strings? (was: How to Write a Spelling Corrector)) Raffael Cavallaro
- Re: Lisps memory usage (was: Efficient way to create strings? (was: How to Write a Spelling Corrector)) ·············@googlemail.com
  - Re: Lisps memory usage (was: Efficient way to create strings? (was: How to Write a Spelling Corrector)) Scott Burson
  - Re: Lisps memory usage (was: Efficient way to create strings? (was: How to Write a Spelling Corrector)) Thomas A. Russ
    - Re: Lisps memory usage (was: Efficient way to create strings? (was: How to Write a Spelling Corrector)) ·············@googlemail.com
      - Re: Lisps memory usage (was: Efficient way to create strings? (was: How to Write a Spelling Corrector)) ·············@googlemail.com
        Re: Lisps memory usage (was: Efficient way to create strings? (was: How to Write a Spelling Corrector)) ·············@googlemail.com
        Re: Lisps memory usage (was: Efficient way to create strings? (was: How to Write a Spelling Corrector)) ·············@googlemail.com

From: Rainer Joswig
Subject: Re: Lisps memory usage (was: Efficient way to create strings? (was: How to Write a Spelling Corrector))
Date: Mon, 27 Oct 2008 19:33:14 +0000
Message-ID: <joswig-811FF5.21331427102008@news-europe.giganews.com>

In article 
<····································@s50g2000hsb.googlegroups.com>,
 ·············@googlemail.com wrote:

> > There was a discussion sometime ago regarding a lisp version of
> > Norvig's spell checking program.
> > http://groups.google.com/group/comp.lang.lisp/browse_thread/thread/78...
> > However, the main concern at the time was to minimize line count.
> > I was curious and wrote my version that optimizes for readability.
> > However, when I tried it out I noticed that it is _very_ slow. Norvig
> > claimed that the first test case (270 words) took 13 seconds.
> > Here's my result (P4 2.8Ghz SBCL 1.0.3 linux)
> > SPELLCHECK> (time (spelltest *tests1* :verbose t))
> > ...
> > User time    =  0:04:57.094
> > System time  =        0.840
> > Elapsed time =  0:05:14.309
> > Allocation   = 1581329808 bytes
> > 0 Page faults
> > That is simply unacceptable. So I tried a few examples in the above
> > thread and the results were either much worst or _slightly_ better
> > (within 10% range).
> 
> Same results here. My program is located at http://www.profibing.de/lisp/spell/,
> Lisp is ClozureCL 1.3(trunk) 64 bit, Mac OSX.
> 
> I tried to minimize consing with a macro DOEDITS.
> 
> (defun edits1 (word)
>   "Create a list of all words with Levenshtein distance of one."
>   (let (words)
>     (doedits (e1 word)
>       (pushnew (copy-seq e1) words :test #'string=))
>     words))
> 
> (defun edits2 (word)
>   "Create a list of all words with Levenshtein distance of two."
>    (let (words)
>      (doedits (e1 word)
>        (doedits (e2 e1)
>          (pushnew (copy-seq e2) words :test #'string=)))
>        words))
> 
> But (length (edits2 "something")) needs 630 seconds, while the Python
> version does it at a blink.

The functions above look very inefficient.

You are pushing something onto a list. For every object
you push you have potentially to traverse the whole list.
The time complexity is bad - which you will see for long lists.
Then you use copy-seq. Is that needed?

A typical better implementation is to use a hash-table, put
the strings into the hash-table and later get the
contents of the hash-table as the result. Checking if
a string is in the hash-table is likely faster.

The other code in your spell.lisp file also has lots
of opportunities to be optimized. You also don't
need EVAL-WHEN around the macro. The compiler
notices that it is a macro and allows
the use of the macro in the same file.

To find out where your code spends its time, try to use a profiler.
That's a very useful tool.

> The benchmark runs with comparable time but it uses a *lot* of memory.
> 
> Can we do better? How comes it, that Python is so fast, even as a
> interpreted language?

There are no 'interpreted languages'. (Most) Languages can be implemented
by a compiler or interpreter or both (-> C interpreters, Lisp compilers, ...).

> 
> --
> Bernd

-- 
http://lispm.dyndns.org/

From: Marek Kubica
Subject: Re: Lisps memory usage (was: Efficient way to create strings? (was: How to Write a Spelling Corrector))
Date: Mon, 27 Oct 2008 22:49:08 +0000
Message-ID: <ge5gh4$a4t$1@hoshi.visyn.net>

On Mon, 27 Oct 2008 21:33:14 +0200, Rainer Joswig wrote:

>> Can we do better? How comes it, that Python is so fast, even as a
>> interpreted language?
> 
> There are no 'interpreted languages'. (Most) Languages can be
> implemented by a compiler or interpreter or both (-> C interpreters,
> Lisp compilers, ...).

Not only that - Python usually gets compiled into Python VM bytecode or 
Java bytecode or .NET bytecode that gets interpreted. OTOH there are also 
Lisp interpreters which include a Just-in-time compiler. The distinction 
gets blurry.

regards,
Marek

From: Scott Burson
Subject: Re: Lisps memory usage (was: Efficient way to create strings? (was: 	How to Write a Spelling Corrector))
Date: Wed, 29 Oct 2008 05:35:24 +0000
Message-ID: <1e790a9c-4ccb-40e5-8bf8-77e83d25a9c0@n1g2000prb.googlegroups.com>

On Oct 27, 12:59 pm, ·············@googlemail.com wrote:
>
> But (length (edits2 "something")) needs 630 seconds, while the Python
> version does it at a blink.

Are you sure you compiled the code?  I just tried it in SBCL and it
took .167 sec.

-- Scott

From: Raffael Cavallaro
Subject: Re: Lisps memory usage (was: Efficient way to create strings? (was: How  to Write a Spelling Corrector))
Date: Thu, 30 Oct 2008 00:29:45 +0000
Message-ID: <geav5p$kir$1@aioe.org>

On 2008-10-27 15:59:34 -0400, ·············@googlemail.com said:

> But (length (edits2 "something")) needs 630 seconds, while the Python
> version does it at a blink.

No way - I'm running CCL 64-bit as well:

? (lisp-implementation-version)
"Version 1.2-r11241M  (DarwinX8664)"
? (time (length (edits2 "something")))
(LENGTH (EDITS2 "something")) took 359,333 microseconds (0.359333 
seconds) to run
                    with 2 available CPU cores.
During that period, 316,098 microseconds (0.316098 seconds) were spent 
in user mode
                    16,398 microseconds (0.016398 seconds) were spent 
in system mode
60,495 microseconds (0.060495 seconds) was spent in GC.
 26,268,256 bytes of memory allocated.
274675
?

From: ·············@googlemail.com
Subject: Re: Lisps memory usage (was: Efficient way to create strings? (was: 	How to Write a Spelling Corrector))
Date: Thu, 30 Oct 2008 19:50:58 +0000
Message-ID: <52baf906-d221-4678-8b95-0c92b0baf331@d1g2000hsg.googlegroups.com>

On 30 Okt., 01:29, Raffael Cavallaro <················@pas-d'espam-
s'il-vous-plait-mac.com> wrote:
> On 2008-10-27 15:59:34 -0400, ·············@googlemail.com said:
>
> > But (length (edits2 "something")) needs 630 seconds, while the Python
> > version does it at a blink.
>
> No way - I'm running CCL 64-bit as well:
>
> ? (lisp-implementation-version)
> "Version 1.2-r11241M  (DarwinX8664)"
> ? (time (length (edits2 "something")))
> (LENGTH (EDITS2 "something")) took 359,333 microseconds (0.359333
> seconds) to run
>                     with 2 available CPU cores.
> During that period, 316,098 microseconds (0.316098 seconds) were spent
> in user mode
>                     16,398 microseconds (0.016398 seconds) were spent
> in system mode
> 60,495 microseconds (0.060495 seconds) was spent in GC.
>  26,268,256 bytes of memory allocated.
> 274675
> ?
 Meanwhile I changed the code. Instead of PUSHNEW just PUSH is used.
Because of duplicates this leads to the answer 274675 instead of
114324 like in Python.

CL-USER> (defvar *aaa*)
*AAA*
CL-USER> (time (progn (setf *aaa* (edits2 "something")) (values)))
(PROGN (SETF *AAA* (EDITS2 "something")) (VALUES)) took 309,687
microseconds (0.309687 seconds) to run
                    with 2 available CPU
cores.
During that period, 295,427 microseconds (0.295427 seconds) were spent
in user mode
                    14,485 microseconds (0.014485 seconds) were spent
in system mode
57,727 microseconds (0.057727 seconds) was spent in
GC.
 26,268,256 bytes of memory
allocated.
; No value
CL-USER> (length *aaa*)
274675
CL-USER> (time (length (delete-duplicates *aaa* :test #'string=)))
(LENGTH (DELETE-DUPLICATES *AAA* :TEST #'STRING=)) took 1,610,182,581
microseconds (1610.182500 seconds) to run
                    with 2 available CPU
cores.
During that period, 1,570,058,823 microseconds (1570.058800 seconds)
were spent in user mode
                    3,492,789 microseconds (3.492789 seconds) were
spent in system mode
 16 bytes of memory
allocated.
 434 minor page faults, 434 major page faults, 0
swaps.
114324




It seems that emulating Pythons set() with PUSHNEW or DELETE-
DUPLICATES is a bad idea.

--
Bernd

From: Scott Burson
Subject: Re: Lisps memory usage (was: Efficient way to create strings? (was: 	How to Write a Spelling Corrector))
Date: Thu, 30 Oct 2008 22:38:22 +0000
Message-ID: <47f4355a-4505-43c4-82a9-57f1cdedb38e@1g2000prd.googlegroups.com>

On Oct 30, 12:50 pm, ·············@googlemail.com wrote:
>  Meanwhile I changed the code. Instead of PUSHNEW just PUSH is used.

Oh!  I didn't notice.

I changed it to use FSet and got a time of 1.2 sec:

(defun edits2 (word)
  "Create a list of all words with Levenshtein distance of two."
  (let ((words (set)))
    (doedits (e1 word)
      (doedits (e2 e1)
        (adjoinf words (copy-seq e2))))
    (convert 'list words)))

FSET-USER> (time (size (edits2 "something")))
;;; Real time: 1.2 sec
114324

http://common-lisp.net/project/fset/

(I'll have the FSet 1.2 release, complete with brand new tutorial, out
this weekend for sure.)

-- Scott

From: Thomas A. Russ
Subject: Re: Lisps memory usage (was: Efficient way to create strings? (was:  How to Write a Spelling Corrector))
Date: Thu, 30 Oct 2008 22:42:57 +0000
Message-ID: <ymid4hhd7ri.fsf@blackcat.isi.edu>

·············@googlemail.com writes:

> It seems that emulating Pythons set() with PUSHNEW or DELETE-
> DUPLICATES is a bad idea.

Only if the set is large.

The only implementation available for serial PUSHNEW forms is a N^2
algorithm.  The typical implementations for DELETE/REMOVE-DUPLICATES are
also N^2 algorithms.

Fortunately, it isn't hard to write a faster function for
REMOVE-DUPLICATES in the case of large inputs.  That is actually fairly
common.

A better approach would be to use a different implementation technique
for sets than plain lists.  Certainly Java does that with their use of
HashSets.  I'm not sure what Python does.

You could either create your own set implementation class or else IIRC
there is something in one of the CL libraries, but I'm not sure exactly
which one has it.

-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: ·············@googlemail.com
Subject: Re: Lisps memory usage (was: Efficient way to create strings? (was: 	How to Write a Spelling Corrector))
Date: Sat, 01 Nov 2008 08:52:07 +0000
Message-ID: <3b0b09cc-cfe7-46c2-b769-7d26f1dce07c@i20g2000prf.googlegroups.com>

Some experiments with hash-tables:

(defun edits1-list (word)
  "Create a list of all words with Levenshtein distance of one."
  (let (words)
    (doedits (e1 word)
      (push (copy-seq e1) words))
    (delete-duplicates words :test #'string=)))


(defun edits1-hash (word)
  "Create a hash-table of all words with Levenshtein distance of one."
  (let ((words (make-hash-table :test #'equal :size (+ 25 (* 54
(length word))))))
    (doedits (e1 word)
      (setf (gethash e1 words) t))
    words))

CL-USER> (length (edits1-list "something"))
494
CL-USER> (hash-table-count (edits1-hash "something"))
482

Why is there a difference?

From: ·············@googlemail.com
Subject: Re: Lisps memory usage (was: Efficient way to create strings? (was: 	How to Write a Spelling Corrector))
Date: Sat, 01 Nov 2008 09:10:44 +0000
Message-ID: <a7b381e0-3ed8-4e7b-a544-744a9936448b@x16g2000prn.googlegroups.com>

And it is not dependend on Lisp implementation:

(hash-table-count (edits1-hash "something"))
;; AllegroCL:
482
;; ClozureCL:
254
;; SBCL:
254

All with MacOSX/PPC 32-bit

From: ·············@googlemail.com
Subject: Re: Lisps memory usage (was: Efficient way to create strings? (was: 	How to Write a Spelling Corrector))
Date: Sat, 01 Nov 2008 09:13:27 +0000
Message-ID: <6d4db231-4845-48df-9f9c-022892f705c4@f40g2000pri.googlegroups.com>

On 1 Nov., 10:10, ·············@googlemail.com wrote:
> And it is not dependend on Lisp implementation:
>
> (hash-table-count (edits1-hash "something"))
> ;; AllegroCL: 482
> ;; ClozureCL:254
> ;; SBCL:254

s/not dependend/dependent/

From: ·············@googlemail.com
Subject: Re: Lisps memory usage (was: Efficient way to create strings? (was: 	How to Write a Spelling Corrector))
Date: Sat, 01 Nov 2008 09:29:41 +0000
Message-ID: <fb5ed207-9855-431d-92a4-e0dafddcc837@n1g2000prb.googlegroups.com>

Found it. The addresses of MAKE-STRING where reused.

(defun edits1-hash (word)
  "Create a hash-table of all words with Levenshtein distance of one."
  (let ((words (make-hash-table :test #'equal :size (+ 25 (* 54
(length word))))))
    (doedits (e1 word)
      (setf (gethash (copy-seq e1) words) t))
    words))

CL-USER> (hash-table-count (edits1-hash "something"))
494