From: Joel Reymont
Subject: Vector processing with Lisp
Date: 
Message-ID: <1122744739.528446.219440@g43g2000cwa.googlegroups.com>
Inspired by the likes of K and with an eye of processing multiple
densely packed arrays that form the base of MonetDB I decided to apply
Lisp to vector processing.

Papers like http://www-db.cs.wisc.edu/cidr/cidr2005/papers/P19.pdf talk
at length about optimizing cache usage and processing multiple vectors
in the same function. I would love to have optimized Lisp primitives
(such as aggregate) to process a list of arrays and return a list of
results. I thought I would start small, though.

I quickly wrote up the following simple functions and thought I would
look at the disassemble dump. I'm on a PowerBook G4 with LispWorks and
results don't seem to be impressive. What do you make of the results
and how would you suggest tackling vector processing with Lisp in an
efficient manner? Just a note, I do want to process multiple vectors at
a time as the theory says this is more efficient.

    Thanks, Joel

P.S.

;; 67 assembler instructions

(defun sum1 (list acc)
  (if (null list)
      acc
      (sum1 (cdr list) (+ acc (car list)))
      ))

;; 53 assembler instructions

(defun sum2 (list)
  (let ((acc 0))
    (dolist (l1 list acc)
      (incf acc l1)
      )))

;; 60  assembler instructions

(defun sum3 (list)
  (let ((acc 0))
    (loop
       for l1 in list
	 do
	 (incf acc l1))
    acc))

;; 86  assembler instructions

(defun sum4 (vec)
  (let ((acc 0))
    (loop for x across vec
	 do
	 (incf acc x))
    acc))

From: Joel Reymont
Subject: Re: Vector processing with Lisp
Date: 
Message-ID: <1122749077.082680.209470@g47g2000cwa.googlegroups.com>
I tried a few optimizations and here are the new results. The number of
assembler instructions is in the comment pre/post optimization.

(eval-when (compile)
  (hcl:toggle-source-debugging nil))

(eval-when (compile load eval)
  (defvar *optimize* '(optimize (safety 0)
		       (space 0)
		       (debug 0)
		       (float 0)
		       (hcl:fixnum-safety 0)
		       (speed 3))))

;; 67

(defun sum1 (list acc)
  (if (null list)
      acc
      (sum1 (cdr list) (+ acc (car list)))
      ))

;; 53/52

(defun sum2 (list)
  (let ((acc 0))
    (dolist (l1 list acc)
      (incf acc l1)
      )))

;; 60/53

(defun sum3 (list)
  (let ((acc 0))
    (loop
       for l1 in list
	 do
	 (incf acc l1))
    acc))

;; 86/72

(defun sum4 (vec)
  (let ((acc 0))
    (loop for x across vec
	 do
	 (incf acc x))
    acc))

;; 47

(defun sum6 (vec)
  (declare #.*optimize*
	   (type simple-array vec))
  (let ((acc 0))
    (loop for x fixnum across vec
       do
	 (incf acc x))
    acc))
From: Joel Reymont
Subject: Re: Vector processing with Lisp
Date: 
Message-ID: <1122749505.649925.156990@g47g2000cwa.googlegroups.com>
The example posted by Wade in this thread
http://groups-beta.google.com/group/comp.lang.lisp/browse_thread/thread/f3e7534dc2f2ce1f/091a7847e078dbca?q=vector+processing&rnum=2#091a7847e078dbca
, i.e.

(defun f (arr1 arr2)
              (loop for x across arr1
                    for c across arr2
                    sum (* x c)))

makes me think that I should probably generate primitives of different
arity (using macros?), like the "mul" above.

I'm still curios why the number of assembler instructions generated by
Lisp is so high. It does seem high given the problem of iterating
through an array and summing up the elements.

    Thanks, Joel
From: ·······@gmail.com
Subject: Re: Vector processing with Lisp
Date: 
Message-ID: <1122754668.766974.108490@g44g2000cwa.googlegroups.com>
Joel Reymont wrote:
> Inspired by the likes of K and with an eye of processing multiple
> densely packed arrays that form the base of MonetDB I decided to apply
> Lisp to vector processing.
>
> Papers like http://www-db.cs.wisc.edu/cidr/cidr2005/papers/P19.pdf talk
> at length about optimizing cache usage and processing multiple vectors
> in the same function. I would love to have optimized Lisp primitives
> (such as aggregate) to process a list of arrays and return a list of
> results. I thought I would start small, though.
>
> I quickly wrote up the following simple functions and thought I would
> look at the disassemble dump. I'm on a PowerBook G4 with LispWorks and
> results don't seem to be impressive. What do you make of the results
> and how would you suggest tackling vector processing with Lisp in an
> efficient manner? Just a note, I do want to process multiple vectors at
> a time as the theory says this is more efficient.

1. Be careful not to conflate an implementation and a language. SBCL
outputs very sane code for my examples (just prologue + your average
tight loop + epilogue, sometimes with error handling).

2. Why are you timing performance with fixnums anyway? I'd expect
you'll want to use vectors of floats.

3. As the following shall demonstrate, you should be using vectors (aka
arrays) if you want vectors. (Timings are sbcl 0.9.2 on coLinux 2.6
with a P-M 1.6 GHz)

(defpackage :vector-test
    (:use :common-lisp))
(in-package :vector-test)

(defun random-list (n &optional acc)
  (let ((acc nil))
    (dotimes (i n acc)
      (push (floor (* (random most-positive-fixnum)
		      (1- (/ (random 2) 2))
		      (/ 1000)))
	    acc))))

(declaim (optimize (speed 3) (safety 0) (debug 0)))

(declaim (notinline sum-vector-1 sum-vector-2
		    sum-vector-3-base sum-vector-3
		    sum-vector-4-base sum-vector-4))

(defun sum-vector-1 (lst)
  (declare (type list lst)
	   (inline reduce))
  (reduce #'+ lst))

(defun sum-vector-2 (vector)
  (declare (type (simple-array fixnum) vector)
	   (inline reduce))
  (reduce #'+ vector))

(defun sum-vector-3-base (lst)
  (declare (type list lst))
  (labels ((sum-one (lst acc)
	     (declare (type fixnum acc))
	     (let ((rest (cdr lst))
		   (acc (+ acc (the fixnum (first lst)))))
	       (if rest
		   (sum-one rest acc)
		   acc))))
    (sum-one (cdr lst) (first lst))))

(defun sum-vector-3 (lst)
  "Dies if the list's length isn't a multiple of 2/is null (fixable,
but I'm lazy)"
  (declare (type list lst))
  (labels ((sum-two (lst acc1 acc2) ;;args or local doesn't change
anything, it'll be inlined anyway.
	     (declare (type fixnum acc1 acc2))
	     (let ((x1 (first lst))
		   (x2 (second lst))
		   (rest (cddr lst)))
	       (declare (type fixnum x1 x2))
	       (let ((acc1 (+ x1 acc1))
		     (acc2 (+ x2 acc2)))
		 (if rest
		     (sum-two rest acc1 acc2)
		     (+ acc1 acc2))))))
    (sum-two (cddr lst)
	     (the fixnum (first lst))
	     (the fixnum (second lst)))))

(defun sum-vector-4-base (vector)
  (declare (type (simple-array fixnum) vector))
  (let ((acc 0))
    (declare (type fixnum acc))
    (dotimes (i (length vector) acc)
      (setf acc (+ (the fixnum acc) (aref vector i))))))

(defun sum-vector-4 (vector)
  "Same thing re length"
  (declare (type (simple-array fixnum) vector))
  (let ((acc1 0)
	(acc2 0))
    (declare (type fixnum acc1 acc2))
    (dotimes (i (floor (length vector) 2)
	      (+ (the fixnum acc1)
		 (the fixnum acc2)))
      (setf acc1 (+ (the fixnum acc1) (aref vector (* 2 i)))
	    acc2 (+ (the fixnum acc2) (aref vector (1+ (* 2 i))))))))

(declaim (inline sum-vector-1 sum-vector-2
		 sum-vector-3-base sum-vector-3
		 sum-vector-4-base sum-vector-4))

(defun test (n &optional (times 10))
  (declare (type fixnum n times)
	   (inline sum-vector-1 sum-vector-2
		   sum-vector-3-base sum-vector-3
		   sum-vector-4-base sum-vector-4))
  (let* ((lst (random-list n))
	 (vect (make-array n :element-type 'fixnum :initial-contents lst)))
      (format t "~%sum-vector-1~%")
      (time (dotimes (i times)
	      (sum-vector-1 lst)))
      (format t "~%sum-vector-2~%")
      (time (dotimes (i times)
	      (sum-vector-2 vect)))
    (format t "~%sum-vector-3-base~%")
    (time (dotimes (i times)
	    (declare (type fixnum i))
	    (sum-vector-3-base lst)))
    (format t "~%sum-vector-3~%")
    (time (dotimes (i times)
	    (declare (type fixnum i))
	      (sum-vector-3 lst)))
    (format t "~%sum-vector-4-base~%")
    (time (dotimes (i times)
	    (declare (type fixnum i))
	    (sum-vector-4-base vect)))
    (format t "~%sum-vector-4~%")
    (time (dotimes (i times)
	    (declare (type fixnum i))
	    (sum-vector-4 vect)))))

sum-vector-1 REDUCEs over a list, while -2 does the same over a
simple-array.

sum-vector-3-base traverses a list 1 by 1, while sum-vector-3 does so 2
by 2 (the first version was with double-floats; it doesn't make a lot
of sense with fixnums).

sum-vector-4-base traverses a vector 1 by 1, while sum-vector-4 foes so
2 by 2.

First run:

(test (expt 2 20) 10) ;vectors/lists of 2^20 fixnums, sum 10 times.

sum-vector-1
Evaluation took:
  7.494 seconds of real time
  4.71 seconds of user run time
  2.6 seconds of system run time
  14 page faults and
  167,772,048 bytes consed.

sum-vector-2
Evaluation took:
  7.907 seconds of real time
  5.19 seconds of user run time
  2.59 seconds of system run time
  1 page fault and
  167,768,024 bytes consed.

sum-vector-3-base
Evaluation took:
  0.091 seconds of real time
  0.09 seconds of user run time
  0.01 seconds of system run time
  0 page faults and
  0 bytes consed.

sum-vector-3
Evaluation took:
  0.083 seconds of real time
  0.06 seconds of user run time
  0.02 seconds of system run time
  0 page faults and
  0 bytes consed.

sum-vector-4-base
Evaluation took:
  0.058 seconds of real time
  0.05 seconds of user run time
  0.01 seconds of system run time
  0 page faults and
  0 bytes consed.

sum-vector-4
Evaluation took:
  0.081 seconds of real time
  0.07 seconds of user run time
  0.01 seconds of system run time
  0 page faults and
  0 bytes consed.

REDUCEing is a couple orders of magnitude slower. Same benchmark,
without sum-vector-1/2:

(test (expt 2 20) 1000)
sum-vector-3-base
Evaluation took:
  11.343 seconds of real time
  10.03 seconds of user run time
  1.31 seconds of system run time
  0 page faults and
  0 bytes consed.

sum-vector-3
Evaluation took:
  11.163 seconds of real time
  9.92 seconds of user run time
  1.23 seconds of system run time
  0 page faults and
  0 bytes consed.

sum-vector-4-base
Evaluation took:
  4.256 seconds of real time
  3.87 seconds of user run time
  0.39 seconds of system run time
  0 page faults and
  0 bytes consed.

sum-vector-4
Evaluation took:
  8.961 seconds of real time
  7.99 seconds of user run time
  0.97 seconds of system run time
  0 page faults and
  0 bytes consed.

Vectors are much faster, and summing in parallel doesn't seem to be
worth it.

Just for completeness, here are the results for the same previous run,
but with double-floats instead:
sum-vector-3-base
Evaluation took:
  21.582 seconds of real time
  19.36 seconds of user run time
  2.19 seconds of system run time
  0 page faults and
  0 bytes consed.

sum-vector-3
Evaluation took:
  23.721 seconds of real time
  20.68 seconds of user run time
  2.6 seconds of system run time
  0 page faults and
  0 bytes consed.

sum-vector-4-base
Evaluation took:
  8.488 seconds of real time
  7.4 seconds of user run time
  0.96 seconds of system run time
  0 page faults and
  0 bytes consed.

sum-vector-4
Evaluation took:
  9.437 seconds of real time
  8.25 seconds of user run time
  1.07 seconds of system run time
  0 page faults and
  0 bytes consed.

The difference in timing is much smaller, but the basic version is
still faster. It'd probably pay off to process elements in parallel if
you fused loops together (less time spent waiting for memory, and more
on waiting for the operations to finish), or if I used a dumber
processor.

Asm size:
You shouldn't look too much at the prologue, and certainly not at the
error-trapping code. If you'll be processing any non-trivial amount of
data, the time spent inside the main loops will titanically dwarf that
spent outside them. In any case, benchmarking is probably much more
telling, except (maybe) if you think you'll be hitting the I-cache's
limit.

sum-vector-1/2 are meaningless for that, since they simply call REDUCE
sum-vector-3-base: 15 insns (main loop, as for all the other counts)
sum-vector-3: 19 insns
sum-vector-4-base: 5 insns
sum-vector-4: 10 insns

Conclusions:
1. Loops that traverses list must have the code to do so. That means a
lot more memory loads, so larger and slower loop. It also means that,
you can easily get unlucky and have your data strewn around everywhere
in memory, killing locality.
2. CL has real vectors. Don't fake them with lists.

Paul Khuong
From: Joel Reymont
Subject: Re: Vector processing with Lisp
Date: 
Message-ID: <1122766200.477625.194810@g49g2000cwa.googlegroups.com>
Paul,

Thank you for your pointers and the code. I wasn't trying to measure
speed but the number of instructions generated by LispWorks for each
version of the code. Adding optimizations to sum1, sum2 and sum3 (the
list versions) reduced the number of assembler instructions to 15, 10
and 12 respectively.

The winner with 10 instructions is

(eval-when (compile load eval)
  (defvar *optimize* '(optimize (safety 0)
		       (space 0)
		       (debug 0)
		       (float 0)
		       (hcl:fixnum-safety 0)
		       (speed 3))))

(defun sum2 (list)
  (declare #.*optimize*)
  (let ((acc 0))
    (dolist (l1 list acc)
      (incf acc l1)
      )))

CL-USER> (disassemble #'sum2)
    #x0: #x60760000     ori r22,res/arg0,#x0
    #x4: #x38600000     li res/arg0,#x0
    #x8: #x7C18B000     cmp cr0,nil,r22
    #xC: #x41820014     beq #x20
   #x10: #x82B60003     lwz r21,#x3(r22)
   #x14: #x7C751814     addc res/arg0,r21,res/arg0
   #x18: #x82D60007     lwz r22,#x7(r22)
   #x1C: #x4BFFFFEC     b #x8
   #x20: #x38000001     li nargs,#x1
   #x24: #x4E800020     blr
10

Now that things seem to be normal again I can get down to the serious
stuff, i.e. vectors of floats and doubles :-).
From: Edi Weitz
Subject: Re: Vector processing with Lisp
Date: 
Message-ID: <u4qabvnub.fsf@agharta.de>
On 30 Jul 2005 16:30:00 -0700, "Joel Reymont" <······@gmail.com> wrote:

> I wasn't trying to measure speed but the number of instructions
> generated by LispWorks for each version of the code.
>
> [...]
>
>   (defvar *optimize* '(optimize (safety 0)
> 		       (space 0)
> 		       (debug 0)
> 		       (float 0)
> 		       (hcl:fixnum-safety 0)
> 		       (speed 3)))

I don't know if it makes a difference in this case but if you, as you
say, are interested in code size and not in speed why do you tell your
compiler to optimize for speed and not for code size?  SPEED is for
"speed of the object code" and SPACE is for "both code size and
run-time space."

Cheers,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")
From: Joel Reymont
Subject: Re: Vector processing with Lisp
Date: 
Message-ID: <1122793797.431063.118930@g49g2000cwa.googlegroups.com>
Edi,

I tried that and it did not make a difference anywhere except sum6 wich
came out 3 instructions slimmer.

(defun sum6 (vec)
  (declare #.*optimize*
	   (type (simple-array fixnum (*)) vec))
  (let ((acc 0))
    (loop for x fixnum across vec
       do
	 (incf acc x))
    acc))