Optimizing n-gram generation

From: T Taneli Vahakangas
Subject: Optimizing n-gram generation
Date: Tue, 20 Feb 2007 11:55:56 +0000
Message-ID: <slrnetloec.76c.vahakang@kruuna.helsinki.fi>

Hello all,

I'm trying to make a lisp program for language and charset
identification using an n-gram based mechanism. The central
piece of code is the n-gram function:

(defun n-gram (data &optional (maxn 3))
  (let ((n-grams (make-hash-table :test 'equalp :size 100000)))
    (loop for i from 0 to (- (array-total-size data) maxn)
          do (loop for n from 1 to maxn
                   do (incf (gethash (subseq data i (+ i n)) n-grams 0))))
    n-grams))

It is very slow. If I feed it with 2 megabytes of '(unsigned-byte 8)
data, it munches for 45 to 140 seconds depending on whether I run it
on the (quite beefy) AMD X2 or laptop. Feeding strings is faster, but
still too slow. I'm using SBCL 1.0.1.

Are my expectations of speed off the mark and this is about the
fastest I will get? (I'd hate to implement this in C or Python just
to see if it faster.) Why is '(unsigned-byte 8) about 2-3 times
slower than string?

The implementation may be buggy and ugly. I'm a newbie, please tell
me how to improve it. Also the algorithm for chosen for the task of
identification may be suboptimal. In that case I'd like to be
educated, too.

TIA,

	Taneli

Re: Optimizing n-gram generation Vassil Nikolov
- Re: Optimizing n-gram generation ··@codeartist.org
  - Re: Optimizing n-gram generation ··@codeartist.org
    - Re: Optimizing n-gram generation ··@codeartist.org
      - Re: Optimizing n-gram generation T Taneli Vahakangas
    - Re: Optimizing n-gram generation T Taneli Vahakangas
- Re: Optimizing n-gram generation T Taneli Vahakangas
  - Re: Optimizing n-gram generation Vassil Nikolov
    - Re: Optimizing n-gram generation T Taneli Vahakangas
      - Re: Optimizing n-gram generation Vassil Nikolov
        Re: Optimizing n-gram generation John Thingstad

From: Vassil Nikolov
Subject: Re: Optimizing n-gram generation
Date: Fri, 23 Feb 2007 02:57:11 +0000
Message-ID: <yy8vmz35y41k.fsf@eskimo.com>

On 20 Feb 2007 11:55:56 GMT, ········@cc.helsinki.fi (T Taneli Vahakangas) said:
| ...
| (defun n-gram (data &optional (maxn 3))
|   (let ((n-grams (make-hash-table :test 'equalp :size 100000)))
|     (loop for i from 0 to (- (array-total-size data) maxn)
|           do (loop for n from 1 to maxn
|                    do (incf (gethash (subseq data i (+ i n)) n-grams 0))))
|     n-grams))
|
| It is very slow. If I feed it with 2 megabytes of '(unsigned-byte 8)
| data, it munches for 45 to 140 seconds depending on whether I run it
| on the (quite beefy) AMD X2 or laptop. Feeding strings is faster, but
| still too slow. I'm using SBCL 1.0.1.

  Some general questions first:  Compiled?  With declarations (e.g. for
  DATA, as a simple (?) array of (UNSIGNED-BYTE 8))?  What does

    (time (n-gram ...))

  print, including the memory allocation information?

  ---Vassil.

-- 
Our programs do not have bugs; it is just that the users' expectations
differ from the way they are implemented.

From: ··@codeartist.org
Subject: Re: Optimizing n-gram generation
Date: Fri, 23 Feb 2007 11:56:53 +0000
Message-ID: <1172231813.689841.90100@j27g2000cwj.googlegroups.com>

On 23 Feb., 03:57, Vassil Nikolov <···············@pobox.com> wrote:
> On 20 Feb 2007 11:55:56 GMT, ········@cc.helsinki.fi (T Taneli Vahakangas) said:
> | ...
> | (defun n-gram (data &optional (maxn 3))
> |   (let ((n-grams (make-hash-table :test 'equalp :size 100000)))
> |     (loop for i from 0 to (- (array-total-size data) maxn)
> |           do (loop for n from 1 to maxn
> |                    do (incf (gethash (subseq data i (+ i n)) n-grams 0))))
> |     n-grams))
> |
> | It is very slow. If I feed it with 2 megabytes of '(unsigned-byte 8)
> | data, it munches for 45 to 140 seconds depending on whether I run it
> | on the (quite beefy) AMD X2 or laptop. Feeding strings is faster, but
> | still too slow. I'm using SBCL 1.0.1.

You could try not to use SUBSEQ in the inner loop. How whats the
typical range of maxn? if it is 1-3 (on a 64 Bit machine 1-7) I would
access the bytes using aref and construct a fixnum using DPB

(declaim (inline 3gram))
(defun 3gram (data i)
  (let ((a (aref data i))
        (b (aref data (+ i 1)))
        (c (aref data (+ i 2))))
    (dpb a (byte 8 0)
         (dpb b (byte 8 8)
              (dpb c (byte 8 16) 0)))))

If it is longer, I would compute a hash-value from the individual
bytes and put the resulting code into a lisp hash-table.

ciao,
Jochen

From: ··@codeartist.org
Subject: Re: Optimizing n-gram generation
Date: Fri, 23 Feb 2007 12:25:36 +0000
Message-ID: <1172233536.547591.309140@t69g2000cwt.googlegroups.com>

On 23 Feb., 12:56, ·····@codeartist.org" <····@codeartist.org> wrote:
> (declaim (inline 3gram))
> (defun 3gram (data i)
>   (let ((a (aref data i))
>         (b (aref data (+ i 1)))
>         (c (aref data (+ i 2))))
>     (dpb a (byte 8 0)
>          (dpb b (byte 8 8)
>               (dpb c (byte 8 16) 0)))))
>

Or for the more general case:

(defmacro ngram (n data)
  (if (= n 1)
    `(aref data i)
    `(dpb (aref ,data ,n) (byte 8 ,(* (1- n) 8)) (ngram ,(1-
n) ,data))))

Which could look into most-positive-fixnum to decide if this will box
or not and then generate either the above or hashing code. But this is
left as an exercise ;-)

ciao,
Jochen

From: ··@codeartist.org
Subject: Re: Optimizing n-gram generation
Date: Fri, 23 Feb 2007 12:36:46 +0000
Message-ID: <1172234206.526031.9330@t69g2000cwt.googlegroups.com>

On 23 Feb., 13:25, ·····@codeartist.org" <····@codeartist.org> wrote:
> On 23 Feb., 12:56, ·····@codeartist.org" <····@codeartist.org> wrote:
>
> > (declaim (inline 3gram))
> > (defun 3gram (data i)
> >   (let ((a (aref data i))
> >         (b (aref data (+ i 1)))
> >         (c (aref data (+ i 2))))
> >     (dpb a (byte 8 0)
> >          (dpb b (byte 8 8)
> >               (dpb c (byte 8 16) 0)))))
>
> Or for the more general case:
>
> (defmacro ngram (n data)
>   (if (= n 1)
>     `(aref data i)
>     `(dpb (aref ,data ,n) (byte 8 ,(* (1- n) 8)) (ngram ,(1-
> n) ,data))))
>

Duh! Think before posting! The index should of course be a parameter
too:

(defmacro ngram (n data i)
  (if (= n 1)
    `(aref data i)
    `(dpb (aref ,data (+ i ,n)) (byte 8 ,(* (1- n) 8))
          (ngram ,(1- n) ,data))))

And if you want to be safe to not evaluate data or i multiple times:

(defmacro ngram (n data i)
  (rebinding (data i)
    (unless (typep n 'integer)
      (error "n has to be an integer for NGRAM"))
    (if (= n 1)
      `(aref data i)
      `(dpb (aref ,data (+ i ,n)) (byte 8 ,(* (1- n) 8))
            (ngram ,(1- n) ,data)))))

For REBINDING look for

http://groups.google.de/group/comp.lang.lisp/browse_thread/thread/e8738d5378efe9dc/fc60bfa60b99cdd0?lnk=st&q=rebinding+common+lisp&rnum=4&hl=de#fc60bfa60b99cdd0

ciao,
Jochen

From: T Taneli Vahakangas
Subject: Re: Optimizing n-gram generation
Date: Fri, 23 Feb 2007 15:05:56 +0000
Message-ID: <slrnetu0mk.mgq.vahakang@kruuna.helsinki.fi>

In article <······················@t69g2000cwt.googlegroups.com>,
··@codeartist.org wrote:
> 
> Duh! Think before posting! The index should of course be a parameter
> too:
> 
> (defmacro ngram (n data i)
>   (if (= n 1)
>     `(aref data i)
>     `(dpb (aref ,data (+ i ,n)) (byte 8 ,(* (1- n) 8))
>           (ngram ,(1- n) ,data))))
> 
> And if you want to be safe to not evaluate data or i multiple times:
> 
> (defmacro ngram (n data i)
>   (rebinding (data i)
>     (unless (typep n 'integer)
>       (error "n has to be an integer for NGRAM"))
>     (if (= n 1)
>       `(aref data i)
>       `(dpb (aref ,data (+ i ,n)) (byte 8 ,(* (1- n) 8))
>             (ngram ,(1- n) ,data)))))
> 
> For REBINDING look for
> 
> http://groups.google.de/group/comp.lang.lisp/browse_thread/thread/e8738d5378efe9dc/fc60bfa60b99cdd0?lnk=st&q=rebinding+common+lisp&rnum=4&hl=de#fc60bfa60b99cdd0

Thanks! I have already a version that uses AREF and PUSH, which is
about five times faster than SUBSEQ. I will try your version, but
that has to wait until next week. Besides REBINDING, I didn't know
about DPB (and I am quite novice with macros).

For maxn, I don't know yet. I've been using 3 and 4, but now that
n-gram performs better, I'll see what kind of benefits larger values
might offer.

	Taneli

From: T Taneli Vahakangas
Subject: Re: Optimizing n-gram generation
Date: Tue, 27 Feb 2007 20:49:01 +0000
Message-ID: <slrneu969s.6su.vahakang@kruuna.helsinki.fi>

In article <························@t69g2000cwt.googlegroups.com>,
··@codeartist.org wrote:
> On 23 Feb., 12:56, ·····@codeartist.org" <····@codeartist.org> wrote:
>> (declaim (inline 3gram))
>> (defun 3gram (data i)
>>   (let ((a (aref data i))
>>         (b (aref data (+ i 1)))
>>         (c (aref data (+ i 2))))
>>     (dpb a (byte 8 0)
>>          (dpb b (byte 8 8)
>>               (dpb c (byte 8 16) 0)))))
>>

Ok, this is great. It performs in ~8 seconds instead of ~21 seconds.
It also conses 220 times less.

> 
> Or for the more general case:
> 
> (defmacro ngram (n data)
>   (if (= n 1)
>     `(aref data i)
>     `(dpb (aref ,data ,n) (byte 8 ,(* (1- n) 8)) (ngram ,(1-
> n) ,data))))

But this one didn't work (or rather the bugfixed version). Even after
I added a comma here and a parameter there, it still didn't work.

Especially the REBINDING and WITH-UNIQUE-NAMES stuff was way over my
head -- I just blindly copied it over and watched it explode in
technicolor in one version and with red/blue 3-d goggles in the other.

> Which could look into most-positive-fixnum to decide if this will box
> or not and then generate either the above or hashing code. But this is
> left as an exercise ;-)

Sounds like something I will have to look in the future. For now, the
performance is definitely sufficient. I just have to fix the macro so
that I don't have 1GRAM, 2GRAM, 3GRAM etc.

	Taneli

From: T Taneli Vahakangas
Subject: Re: Optimizing n-gram generation
Date: Fri, 23 Feb 2007 10:33:55 +0000
Message-ID: <slrnettgoj.69h.vahakang@kruuna.helsinki.fi>

In article <················@eskimo.com>, Vassil Nikolov wrote:
> 
> On 20 Feb 2007 11:55:56 GMT, ········@cc.helsinki.fi (T Taneli Vahakangas) said:
> | ...
> | (defun n-gram (data &optional (maxn 3))
> |   (let ((n-grams (make-hash-table :test 'equalp :size 100000)))
> |     (loop for i from 0 to (- (array-total-size data) maxn)
> |           do (loop for n from 1 to maxn
> |                    do (incf (gethash (subseq data i (+ i n)) n-grams 0))))
> |     n-grams))
> |
> | It is very slow. If I feed it with 2 megabytes of '(unsigned-byte 8)
> | data, it munches for 45 to 140 seconds depending on whether I run it
> | on the (quite beefy) AMD X2 or laptop. Feeding strings is faster, but
> | still too slow. I'm using SBCL 1.0.1.
> 
>   Some general questions first:  Compiled?  With declarations (e.g. for
>   DATA, as a simple (?) array of (UNSIGNED-BYTE 8))?  What does

I use slime and press C-c, which says it compiles, is that what you
mean? I know nothing about declarations ...

>     (time (n-gram ...))
> 
>   print, including the memory allocation information?

I'm currently using about 4 megabyte training corpus, but here goes:

CL-USER> (time (defparameter *h1* (n-gram-slow *s1* 4)))
Evaluation took:
  111.304 seconds of real time
  111.16695 seconds of user run time
  0.136008 seconds of system run time
  [Run times include 0.4 seconds GC run time.]
  0 calls to %EVAL
  0 page faults and
  761,305,472 bytes consed.
*H1*

With aref it is about 5 times faster:

CL-USER> (time (defparameter *h1* (n-gram *s1* 4)))
Evaluation took:
  23.598 seconds of real time
  23.38546 seconds of user run time
  0.212014 seconds of system run time
  [Run times include 0.544 seconds GC run time.]
  0 calls to %EVAL
  0 page faults and
  887,527,424 bytes consed.
*H1*

There isn't much information about memory allocations.

	Taneli

From: Vassil Nikolov
Subject: Re: Optimizing n-gram generation
Date: Sat, 24 Feb 2007 05:28:01 +0000
Message-ID: <yy8v8xeo15we.fsf@eskimo.com>

On 23 Feb 2007 10:33:55 GMT, ········@cc.helsinki.fi (T Taneli Vahakangas) said:
| I use slime and press C-c, which says it compiles, is that what you
| mean? I know nothing about declarations ...

  You can see what (if any) difference in performance it makes with

    (defun n-gram (data &optional (maxn 3))
      (declare (type (simple-array ...) data))
      ;;                           ^^^ filling this in is left as an exercise...
      (declare (fixnum maxn))
      (let ...))

  and possibly adding an OPTIMIZE declaration as well.  (I believe SBCL may
  be able to figure out that I and N are fixnums on its own.)  It might also
  help if you wrap the GETHASH call in a (THE FIXNUM ...) form.  Of course,
  using Common Lisp declarations is pretty well covered in the literature.

| ...

| CL-USER> (time (defparameter *h1* (n-gram-slow *s1* 4)))
| ...
|   [Run times include 0.4 seconds GC run time.]
| ...
|   0 page faults and
|   761,305,472 bytes consed.
| *H1*

| With aref it is about 5 times faster:

| CL-USER> (time (defparameter *h1* (n-gram *s1* 4)))
| ...
|   [Run times include 0.544 seconds GC run time.]
| ...
|   0 page faults and
|   887,527,424 bytes consed.
| *H1*

| There isn't much information about memory allocations.

  Oh, but there is (I have left it above).  Even if you miss the fact
  that SUBSEQ is called a _lot_ of times, TIME will tell you that the
  above code conses a lot.  (As a rule of thumb, in a tight inner
  loop, you want "0 bytes consed"...)

  It has already been pointed out that if MAXN is small, you are much
  better off with a FUXNUM representing the n-gram, rather than a
  subsequence.  For larger values of MAXN, it still might be better
  (you will have to experiment), or making a displaced array for the
  hash key might be better (again, you'd have to experiment).  Of
  course, using a custom hash table would be best, but needs more
  work.

  Another thing: do you really need to hash on EQUALP?  If yes, is
  that because you want to ignore letter case differences?  If so,
  it would be much better to canonicalize the letter case once,
  before running the loop, rather than every time a hash key is
  obtained.

  ---Vassil.


-- 
Our programs do not have bugs; it is just that the users' expectations
differ from the way they are implemented.

From: T Taneli Vahakangas
Subject: Re: Optimizing n-gram generation
Date: Tue, 27 Feb 2007 21:10:48 +0000
Message-ID: <slrneu97io.6su.vahakang@kruuna.helsinki.fi>

In article <················@eskimo.com>, Vassil Nikolov wrote:
> 
> On 23 Feb 2007 10:33:55 GMT, ········@cc.helsinki.fi (T Taneli Vahakangas) said:
> | I use slime and press C-c, which says it compiles, is that what you
> | mean? I know nothing about declarations ...
> 
>   You can see what (if any) difference in performance it makes with
> 
>     (defun n-gram (data &optional (maxn 3))
>       (declare (type (simple-array ...) data))
>       ;;                           ^^^ filling this in is left as an exercise...
>       (declare (fixnum maxn))
>       (let ...))
> 
>   and possibly adding an OPTIMIZE declaration as well.  (I believe SBCL may
>   be able to figure out that I and N are fixnums on its own.)  It might also
>   help if you wrap the GETHASH call in a (THE FIXNUM ...) form.  Of course,
>   using Common Lisp declarations is pretty well covered in the literature.

Nice, that helps. Thank you for pointing out DECLARE, I just didn't
know about it. Declaring types for data and maxn speeded it up by
5-6 percent and wrapping GETHASH gave about 0.2 percent, if I
remember correctly (or whatever, it was statistically insignificant).

>   above code conses a lot.  (As a rule of thumb, in a tight inner
>   loop, you want "0 bytes consed"...)

Ok, understood. Now I just have to figure out which constructs
do consing and which do not ... is this something people look
up from a reference book or just learn by trial?

This far I've read Graham's "On Lisp" first 90 pages or so and
then peeked into CLHS very often.

>   It has already been pointed out that if MAXN is small, you are much
>   better off with a FUXNUM representing the n-gram, rather than a
>   subsequence.  For larger values of MAXN, it still might be better
>   (you will have to experiment), or making a displaced array for the
>   hash key might be better (again, you'd have to experiment).  Of
>   course, using a custom hash table would be best, but needs more
>   work.

I was a little afraid that I'd need to make a custom hash-table.
I am happy with the current performance, so that is not an issue.
At least not yet.

>   Another thing: do you really need to hash on EQUALP?  If yes, is
>   that because you want to ignore letter case differences?  If so,
>   it would be much better to canonicalize the letter case once,
>   before running the loop, rather than every time a hash key is
>   obtained.

Actually, at this point I can't tell which codes are characters and
definitely not their case. The reason for EQUALP was that when using
SUBSEQ, I had arrays as hash keys and EQUAL just didn't cut it:
> (EQUAL #(1 2 3) #(1 2 3))
NIL
> (EQUALP #(1 2 3) #(1 2 3))
T

I may have misunderstood a very basic detail here ...

	Taneli

From: Vassil Nikolov
Subject: Re: Optimizing n-gram generation
Date: Wed, 28 Feb 2007 02:29:50 +0000
Message-ID: <yy8vbqjfm2u9.fsf@eskimo.com>

On 27 Feb 2007 21:10:48 GMT, ········@cc.helsinki.fi (T Taneli Vahakangas) said:
| ...
| Declaring types for data and maxn speeded it up by
| 5-6 percent and wrapping GETHASH gave about 0.2 percent, if I
| remember correctly (or whatever, it was statistically insignificant).

  I see...  I suppose it was still worth a try.  Was that together
  with an OPTIMIZE declaration, by the way, e.g. if you are sure all
  values will be valid for their respective operations, you might
  say

    (declare (optimize speed (safety 0)))  ;this code needs to burn...

| Ok, understood. Now I just have to figure out which constructs
| do consing and which do not ... is this something people look
| up from a reference book or just learn by trial?

  The former.  Look for words such as "new" or "fresh", e.g. CLHS
  says "subseq always allocates a new sequence for a result".
  Textbooks as a rule point these things out, too, at least for
  frequently occurring cases such as APPEND.

  The canonical example is, of course, CONS itself, which always
  returns a fresh cons (needless to say).

  Of course, if you encounter an undocumented library, you might
  have to resort to trial or examining the source if available,
  but that is another matter...

|| Another thing: do you really need to hash on EQUALP?  If yes, is
|| that because you want to ignore letter case differences?  If so,
|| it would be much better to canonicalize the letter case once,
|| before running the loop, rather than every time a hash key is
|| obtained.

| Actually, at this point I can't tell which codes are characters and
| definitely not their case. The reason for EQUALP was that when using
| SUBSEQ, I had arrays as hash keys and EQUAL just didn't cut it:
|| (EQUAL #(1 2 3) #(1 2 3))
| NIL
|| (EQUALP #(1 2 3) #(1 2 3))
| T

  Of course.  That was me not paying attention...

  ---Vassil.


-- 
mind mate, n.
  One of two persons mentally compatible with each other (cf. soul mate).

From: John Thingstad
Subject: Re: Optimizing n-gram generation
Date: Wed, 28 Feb 2007 16:23:18 +0000
Message-ID: <op.togvs4nbpqzri1@pandora.upc.no>

On Wed, 28 Feb 2007 03:29:50 +0100, Vassil Nikolov  
<···············@pobox.com> wrote:

> | Ok, understood. Now I just have to figure out which constructs
> | do consing and which do not ... is this something people look
> | up from a reference book or just learn by trial?
>
>   The former.  Look for words such as "new" or "fresh", e.g. CLHS
>   says "subseq always allocates a new sequence for a result".
>   Textbooks as a rule point these things out, too, at least for
>   frequently occurring cases such as APPEND.
>
>   The canonical example is, of course, CONS itself, which always
>   returns a fresh cons (needless to say).
>
>   Of course, if you encounter an undocumented library, you might
>   have to resort to trial or examining the source if available,
>   but that is another matter...
>


And the latter.. For instance I had some code like

(if (= (/ i p) (floor (/ ip)))

The problem with this code is that '/ conses the fractions on the heap.

use (rem i p) instead

For my code the time went from 36 seconds to 6 seconds because of this  
alone.

(trying all positive integers less that 1000 000 for primality)

The way I saw it was with the return from 'time.

So the moral is look at the number of allocations returned from 'time as  
well.
If it seems high isolate the problem then correct it (if possible).

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/