I another thread: http://makeashorterlink.com/?N69B61DCC I noted that
doing tree recursion in CPS can be expensive.
This was my example:
CL-USER> (defun make-zero-tree (depth)
(if (zerop depth)
nil
(let ((tree (make-zero-tree (- depth 1))))
(cons 0 (cons tree tree)))))
MAKE-ZERO-TREE
CL-USER> (defun num-node (tree)
(if (null tree)
0
(+ 1 (num-node (cadr tree)) (num-node (cddr tree)))))
NUM-NODE
CL-USER> (defun num-node-cps (tree &optional (k #'identity))
(if (null tree)
(funcall k 0)
(num-node-cps (cadr tree)
#'(lambda (n)
(num-node-cps (cddr tree)
#'(lambda (m)
(funcall k (+ 1 m n))))))))
NUM-NODE-CPS
CL-USER> (let ((tree (make-zero-tree 28)))
(time (num-node tree)))
Evaluation took:
12.308 seconds of real time
12.304769 seconds of user run time
0.0 seconds of system run time
0 page faults and
8,192 bytes consed.
268435455
CL-USER> (let ((tree (make-zero-tree 28)))
(time (num-node-cps tree)))
Evaluation took:
82.417 seconds of real time
77.42084 seconds of user run time
0.352022 seconds of system run time
0 page faults and
12,884,939,992 bytes consed.
268435455
CL-USER>
I'm waiting to hear if there is a way to make the CPS code less
expensive. In the meantime, I tried to see how much of a speedup I could
get by turning the function into special purpose stack machine. This is
what I came up with:
(defun num-node-smachine (tree)
(let* ((stack (make-array '(1000)))
(sp (progn (setf (svref stack 0) tree) 1))
(tree nil)
(dump (make-array '(1000)))
(dp 0)
(state 'call)
(m 0)
(n 0)
(rv 0)
(ret-inst 'halt))
(declare (type fixnum m n rv sp dp)
(optimize (speed 3)))
(loop (ecase state
(call (setf (svref dump dp) ret-inst
dp (1+ dp)
(svref dump dp) tree
dp (1+ dp)
sp (1- sp)
(svref dump dp) sp
dp (1+ dp)
tree (svref stack sp)
sp (1+ sp)
state 'branch))
(halt (return (aref stack (1- sp))))
(branch (if (null tree)
(setf state 'push0)
(setf state 'push-lst)))
(push0 (setf (svref stack sp) 0
sp (1+ sp)
state 'rtn))
(push-lst (setf (svref stack sp) (cadr tree)
sp (1+ sp)
ret-inst 'push-rst
state 'call))
(push-rst (setf (svref stack sp) (cddr tree)
sp (1+ sp)
ret-inst 'add1
state 'call))
(add1 (setf sp (1- sp)
m (svref stack sp)
sp (1- sp)
n (svref stack sp)
(svref stack sp) (+ 1 m n)
sp (1+ sp)
state 'rtn))
(rtn (setf sp (1- sp)
rv (svref stack sp)
dp (1- dp)
sp (svref dump dp)
(svref stack sp) rv
sp (1+ sp)
dp (1- dp)
tree (svref dump dp)
dp (1- dp)
state (svref dump dp)))))))
Yikes!
CL-USER> (let ((tree (make-zero-tree 28)))
(time (num-node-smachine tree)))
Evaluation took:
36.516 seconds of real time
36.50628 seconds of user run time
0.004 seconds of system run time
0 page faults and
40,768 bytes consed.
268435455
CL-USER>
An improvement on the CPS code, but still a lot worse than the
original function. This is just a toy, but how do you get good
performance in lisp using low level code like num-node-smachine?
Matt
--
"You do not really understand something unless you can
explain it to your grandmother." — Albert Einstein.
Matthew D Swank wrote:
> I another thread: http://makeashorterlink.com/?N69B61DCC I noted that
> doing tree recursion in CPS can be expensive.
>
> CL-USER> (let ((tree (make-zero-tree 28)))
> (time (num-node tree)))
> Evaluation took:
> 12.308 seconds of real time
> 12.304769 seconds of user run time
> 0.0 seconds of system run time
> 0 page faults and
> 8,192 bytes consed.
> 268435455
> CL-USER> (let ((tree (make-zero-tree 28)))
> (time (num-node-cps tree)))
> Evaluation took:
> 82.417 seconds of real time
> 77.42084 seconds of user run time
> 0.352022 seconds of system run time
> 0 page faults and
> 12,884,939,992 bytes consed.
> 268435455
That's `only' a factor of 6 slower. Given that you increased consing
by a factor of a million, that's not bad at all.
Using the stack is going to be more efficient than using the heap. See
@techreport{ miller94garbage,
author = "James S. Miller and Guillermo J. Rozas",
title = "Garbage Collection is Fast, but a Stack is Faster",
number = "AIM-1462",
pages = "37",
year = "1994",
url = "citeseer.ist.psu.edu/miller94garbage.html" }
>
> I'm waiting to hear if there is a way to make the CPS code less
> expensive.
I can't think of anything obvious. You may want to check that your GC
is tuned for this large amount of consing. You should make your
youngest generation maybe 3/4 the size of physical memory if your
implementation allows it.
> In the meantime, I tried to see how much of a speedup I could
> get by turning the function into special purpose stack machine. This is
> what I came up with:
[code snipped]
I think most of the time here is wasted on the ECASE dispatch. This is
going to turn into a big COND expression that checks the state against
each of the tags in turn.
You can inline the majority of your state transitions, so I'd start
with that. You can use TAGBODY and GO to control the flow. They
should turn into jump instructions.
On Thu, 16 Mar 2006 09:32:06 -0800, Joe Marshall wrote:
> I think most of the time here is wasted on the ECASE dispatch. This is
> going to turn into a big COND expression that checks the state against
> each of the tags in turn.
>
> You can inline the majority of your state transitions, so I'd start
> with that. You can use TAGBODY and GO to control the flow. They
> should turn into jump instructions.
Here's the new version:
(defun num-node-smachine (tree)
(let* ((stack (make-array '(1000)))
(sp (progn (setf (svref stack 0) tree) 1))
(tree nil)
(dump (make-array '(1000)))
(dp 0)
(m 0)
(n 0)
(rv 0)
(ret-label 'halt)
(jmp nil))
(declare (type fixnum m n rv sp dp)
(optimize (speed 3)))
(tagbody
call
(setf (svref dump dp) ret-label
dp (1+ dp)
(svref dump dp) tree
dp (1+ dp)
sp (1- sp)
(svref dump dp) sp
dp (1+ dp)
tree (svref stack sp)
sp (1+ sp))
(go branch)
branch
(if (null tree)
(go push0)
(go push-lst))
push0
(setf (svref stack sp) 0
sp (1+ sp))
(go rtn)
push-lst
(setf (svref stack sp) (cadr tree)
sp (1+ sp)
ret-label 'push-rst)
(go call)
push-rst
(setf (svref stack sp) (cddr tree)
sp (1+ sp)
ret-label 'add1)
(go call)
add1
(setf sp (1- sp)
m (svref stack sp)
sp (1- sp)
n (svref stack sp)
(svref stack sp) (+ 1 m n)
sp (1+ sp))
(go rtn)
rtn
(setf sp (1- sp)
rv (svref stack sp)
dp (1- dp)
sp (svref dump dp)
(svref stack sp) rv
sp (1+ sp)
dp (1- dp)
tree (svref dump dp)
dp (1- dp)
jmp (svref dump dp))
(cond ((eq jmp 'halt)
(go halt))
((eq jmp 'push-rst)
(go push-rst))
((eq jmp 'add1)
(go add1)))
halt)
(svref stack (1- sp))))
CL-USER> (let ((tree (make-zero-tree 28)))
(time (num-node6 tree)))
Evaluation took:
21.493 seconds of real time
21.473343 seconds of user run time
0.0 seconds of system run time
0 page faults and
33,296 bytes consed.
268435455
CL-USER>
Cut the time by about a third. I suppose if this were scheme I could get
the tagbody to jump optimization by using named thunks?
E.g.
(letrec (;... register & stack init
(rtn (lambda ()
(set! sp (- sp 1))
(set! rv (vector-ref stack sp))
(set! dp (- dp 1))
(set! sp (vector-ref dump dp))
(vector-set! stack sp rv)
(set! sp (+ sp 1))
(set! dp (- dp 1))
(set! tree (vector-ref dump dp))
(set! dp (- dp 1))
(set! jmp (vector-ref dump dp))
(cond ((eq jmp 'halt)
(halt))
((eq jmp 'push-rst)
(push-rst))
((eq jmp 'add1)
(add1)))))
...more states
Matt
--
"You do not really understand something unless you can
explain it to your grandmother." — Albert Einstein.
On Thu, 16 Mar 2006 16:01:16 -0600, Matthew D Swank wrote:
> (cond ((eq jmp 'halt)
> (halt))
> ((eq jmp 'push-rst)
> (push-rst))
> ((eq jmp 'add1)
> (add1)))
s:/eq/eq?
--
"You do not really understand something unless you can
explain it to your grandmother." — Albert Einstein.
Matthew D Swank wrote:
> On Thu, 16 Mar 2006 09:32:06 -0800, Joe Marshall wrote:
>
> > I think most of the time here is wasted on the ECASE dispatch. This is
> > going to turn into a big COND expression that checks the state against
> > each of the tags in turn.
> >
> > You can inline the majority of your state transitions, so I'd start
> > with that. You can use TAGBODY and GO to control the flow. They
> > should turn into jump instructions.
>
> Here's the new version:
>
> (defun num-node-smachine (tree)
[code snipped]
Try these declarations:
(declare (type (simple-array t (1000)) stack dump)
(type (integer 0 (1000)) sp dp)
(optimize (speed 3) (safety 0)))
You can do it this way, but why? I tried it out and the compiled code
looks like this:
... (lots of code omitted, this is a representative sample)
L7: 345: 8B7DFC move edi, [ebp-4]
348: C1FF06 sar edi, 6
351: 8B75F4 move esi, [ebp-C]
354: C7443E1000000000 move [esi+10+edi], 0
362: 8B7DFC move edi, [ebp-4]
365: 81C700010000 add edi, 100
371: 897DFC move [ebp-4], edi
L8: 374: 8B7DFC move edi, [ebp-4]
377: 81EF00010000 sub edi, 100
383: 897DFC move [ebp-4], edi
386: 8B7DFC move edi, [ebp-4]
389: C1FF06 sar edi, 6
392: 8B75F4 move esi, [ebp-C]
395: 8B7C3E10 move edi, [esi+10+edi]
399: 8B5DF8 move ebx, [ebp-8]
402: 81EB00010000 sub ebx, 100
408: 895DF8 move [ebp-8], ebx
411: 8B5DF8 move ebx, [ebp-8]
414: C1FB06 sar ebx, 6
417: 8B75EC move esi, [ebp-14]
420: 8B5C1E10 move ebx, [esi+10+ebx]
424: 895DFC move [ebp-4], ebx
427: 8B5DFC move ebx, [ebp-4]
430: 897DF0 move [ebp-10], edi
433: 89DF move edi, ebx
435: C1FF06 sar edi, 6
438: 8B75F0 move esi, [ebp-10]
441: 8B4DF4 move ecx, [ebp-C]
444: 89743910 move [ecx+10+edi], esi
448: F645F007 testb [ebp-10], 7
452: 740B je L9
454: FF75F4 push [ebp-C]
457: 8B45F0 move eax, [ebp-10]
....
As you can see, the code is spending all its time moving local
variables back and forth between the stack and the registers. This is
no doubt because of the large number of locals and the emulated stack.
But this is not going to be anywhere near as efficient as just using
the recursive code.
On Thu, 16 Mar 2006 17:11:04 -0800, Joe Marshall wrote:
> As you can see, the code is spending all its time moving local
> variables back and forth between the stack and the registers. This is
> no doubt because of the large number of locals and the emulated stack.
> But this is not going to be anywhere near as efficient as just using
> the recursive code.
Well, it's just a finger exercise in turning recursion into iteration.
I wouldn't want to maintain code written this way. In fact I didn't so
much write the code as transform it. At one point it was a more straight
forward secd like machine with the states as instructions in a code tree:
code <- (branch (push0 rtn) (push-lst call push-rst call add1 rtn))
Matt
--
"You do not really understand something unless you can
explain it to your grandmother." — Albert Einstein.