From: Bruce L. Lambert
Subject: How to optimize vector cosine computation?
Date: 
Message-ID: <4dh5js$45ao@tigger.cc.uic.edu>
Hi folks,

I've implemented a document clustering algorithm in LISP. The
rate-limiting step in this system is the computation of the pairwise
document similarity matrix. Documents are represented as vectors of
weighted term frequencies, and document similarity is measured by the
vector cosine (correlation).

Below is the code I use to compute the vector cosine. In my case, the
vectors are normally quite short because my "documents" are only one
sentence long. But this computation occurs in the innermost loop and must
be done millions of times. What declarations, proclamations, or other
fancy fixes would make these functions run fastest?
       
(defun inner-product (list1
                      list2)
 (let ((sum 0))
  (dotimes (i (length list1))
   (incf sum (* (elt list1 i) (elt list2 i))))
  sum))


(defun vector-length (list1)
  (sqrt (inner-product list1 list1)))


(defun vector-cosine (list1
                      list2)
  (float (/ (inner-product list1 list2)
            (* (vector-length list1)
               (vector-length list2)))))

Thanks.

-bruce

--replies by email appreciated.

Bruce L. Lambert, Ph.D.
University of Illinois at Chicago
········@uic.edu
·····@ludwig.pmad.uic.edu
+1 (312) 996-2411

From: ···@laphroig.mch.sni.de
Subject: Re: How to optimize vector cosine computation?
Date: 
Message-ID: <ARE.96Jan25100826@laphroig.mch.sni.de>
I think that, apart from the declarations, it would be much faster to
use arrays than list. (elt e.g. is a lot slower than aref or svref)
Also there is no need to cdr along the list, but you can use a simple
dotimes loop. 

Andreas
From: ···@franz.com
Subject: Re: How to optimize vector cosine computation?
Date: 
Message-ID: <4e6k43$242@sparky.franz.com>
In article <··········@camelot.ccs.neu.edu>, ·····@ccs.neu.edu (Michael Tselman) writes:
>> Bruce L. Lambert (········@uic.edu) wrote:
>> : rate-limiting step in this system is the computation of the pairwise
>> : document similarity matrix. Documents are represented as vectors of
>> : weighted term frequencies, and document similarity is measured by the
>> : vector cosine (correlation).
>> :
>> : (defun inner-product (list1 list2)
>> :  (let ((sum 0))
>> :   (dotimes (i (length list1))
>> :    (incf sum (* (elt list1 i) (elt list2 i))))
>> :   sum))
>> 
>> I would rewrite this as: 
>> 
>> (defun inner-product (list1 list2)
>>   (let ((sum 0))
>>     (declare (fixnum sum))
>>     (do* ((l1 list1 (cdr l1))
>> 	  (l2 list2 (cdr l2)))
>> 	((null l1))
>>       (incf sum (the fixnum (* (the fixnum (car l1)) (the fixnum (car l2))))))
>>     sum))
>> 
>> With the above declarations you get a 10 times speed up (on my system - Sparc

Doing (elt list i) on a list is highly ineffcient, I'd imagine an even 
larger speedup if this was actually the case.  However, if the
call to (vector-length list1) is correct below, then list1 is actually a 
vector and not a list.   It can in fact be faster to represent them
as lists and cdr down them rather than do array refs on a vector,
though the locality of the lists becomes critical (and they will take
up more space as well), but the above is a very good optimization suggestion.
It depends on the Lisp compiler's optimizations also, and how long the
lists are.  It may also be much faster and smaller if you can fit the
term's into 8bits and use a vector of (unsigned-byte 8) or (signed-byte 7)
if the compiler will efficiently compile the (aref vector i) calls.

>> (defun vector-cosine (list1 list2)
>>  (float (/ (inner-product list1 list2)
>> 	    (* (vector-length list1)
>> 	       (vector-length list2)))))

Another optimization is to avoid ratio computation 
from the division, and either use truncate instead of divide, 
or make the numbers floats before the division,
or actually, only make ONE a float, and the other will 
get converted internally into a float which can save consing
one float (hopefully the compiler will not cons any)

Another suggestion is to avoid the computation altogether by
caching the results, but in this case it doesn't seem likely
as matches are probably checked from a new query against the
ones in the database, and the query is likely always different.

Hope that helps,
 -Kelly Murray   ···@franz.com  http://www.franz.com