Recursion CPS, and doing it yourself

From: Matthew D Swank
Subject: Recursion CPS, and doing it yourself
Date: Thu, 16 Mar 2006 00:41:44 +0000
Message-ID: <pan.2006.03.16.00.41.43.869547@c.net>

I another thread: http://makeashorterlink.com/?N69B61DCC I noted that
doing tree recursion in CPS can be expensive.

This was my example:
CL-USER> (defun make-zero-tree (depth)
           (if (zerop depth)
               nil
               (let ((tree (make-zero-tree (- depth 1)))) 
                 (cons 0 (cons tree tree)))))
MAKE-ZERO-TREE
CL-USER> (defun num-node (tree)
           (if (null tree)
               0
               (+ 1 (num-node (cadr tree)) (num-node (cddr tree)))))
NUM-NODE
CL-USER> (defun num-node-cps (tree &optional (k #'identity))
           (if (null tree)
               (funcall k 0)
               (num-node-cps  (cadr tree)
                              #'(lambda (n)
                                  (num-node-cps  (cddr tree)
                                                 #'(lambda (m)
                                                     (funcall k (+ 1 m n))))))))
NUM-NODE-CPS
CL-USER> (let ((tree (make-zero-tree 28)))
           (time (num-node tree)))
Evaluation took:
  12.308 seconds of real time
  12.304769 seconds of user run time
  0.0 seconds of system run time
  0 page faults and
  8,192 bytes consed.
268435455
CL-USER> (let ((tree (make-zero-tree 28)))
           (time (num-node-cps tree)))
Evaluation took:
  82.417 seconds of real time
  77.42084 seconds of user run time
  0.352022 seconds of system run time
  0 page faults and
  12,884,939,992 bytes consed.
268435455
CL-USER> 

I'm waiting to hear if there is a way to make the CPS code less
expensive.  In the meantime, I tried to see how much of a speedup I could
get by turning the function into special purpose stack machine.  This is
what I came up with:


(defun num-node-smachine (tree)
  (let* ((stack (make-array '(1000)))
         (sp (progn (setf (svref stack 0) tree) 1))
         (tree nil)
         (dump (make-array '(1000)))
         (dp 0)
         (state 'call)
         (m 0)
         (n 0)
         (rv 0)
         (ret-inst 'halt))
    (declare (type fixnum m n rv sp dp)
             (optimize (speed 3)))
    (loop (ecase state
            (call (setf (svref dump dp) ret-inst    
                        dp (1+ dp)                        
                        (svref dump dp) tree 
                        dp (1+ dp)
                        sp (1- sp)
                        (svref dump dp) sp
                        dp (1+ dp)
                        tree (svref stack sp)
                        sp (1+ sp)
                        state 'branch))
            
            (halt (return (aref stack (1- sp))))
            
            (branch (if (null tree)
                        (setf state 'push0)
                      (setf state 'push-lst)))

            (push0 (setf (svref stack sp) 0 
                         sp (1+ sp)
                         state 'rtn))
            
            (push-lst (setf (svref stack sp) (cadr tree)
                            sp (1+ sp)
                            ret-inst 'push-rst
                            state 'call))
            
            (push-rst (setf (svref stack sp) (cddr tree) 
                            sp (1+ sp)
                            ret-inst 'add1
                            state 'call))
            
            (add1 (setf sp (1- sp)
                        m (svref stack sp)
                        sp (1- sp)
                        n (svref stack sp)
                        (svref stack sp) (+ 1 m n)
                        sp (1+ sp)
                        state 'rtn))
            
            (rtn (setf sp (1- sp)
                       rv (svref stack sp)
                       dp (1- dp)
                       sp (svref dump dp)
                       (svref stack sp) rv
                       sp (1+ sp)
                       dp (1- dp)
                       tree (svref dump  dp)
                       dp (1- dp)
                       state (svref dump dp)))))))

Yikes!

CL-USER> (let ((tree (make-zero-tree 28)))
           (time (num-node-smachine tree)))
Evaluation took:
  36.516 seconds of real time
  36.50628 seconds of user run time
  0.004 seconds of system run time
  0 page faults and
  40,768 bytes consed.
268435455
CL-USER> 

An improvement on the CPS code, but still a lot worse than the
original function.  This is just a toy, but how do you get good
performance in lisp using low level code like num-node-smachine?

Matt

-- 
"You do not really understand something unless you can
 explain it to your grandmother." — Albert Einstein.

Re: Recursion CPS, and doing it yourself Joe Marshall
- Re: Recursion CPS, and doing it yourself Matthew D Swank
  - Re: Recursion CPS, and doing it yourself Matthew D Swank
  - Re: Recursion CPS, and doing it yourself Joe Marshall
    - Re: Recursion CPS, and doing it yourself Matthew D Swank

From: Joe Marshall
Subject: Re: Recursion CPS, and doing it yourself
Date: Thu, 16 Mar 2006 17:32:06 +0000
Message-ID: <1142530326.451240.28670@e56g2000cwe.googlegroups.com>

Matthew D Swank wrote:
> I another thread: http://makeashorterlink.com/?N69B61DCC I noted that
> doing tree recursion in CPS can be expensive.
>
> CL-USER> (let ((tree (make-zero-tree 28)))
>            (time (num-node tree)))
> Evaluation took:
>   12.308 seconds of real time
>   12.304769 seconds of user run time
>   0.0 seconds of system run time
>   0 page faults and
>   8,192 bytes consed.
> 268435455
> CL-USER> (let ((tree (make-zero-tree 28)))
>            (time (num-node-cps tree)))
> Evaluation took:
>   82.417 seconds of real time
>   77.42084 seconds of user run time
>   0.352022 seconds of system run time
>   0 page faults and
>   12,884,939,992 bytes consed.
> 268435455

That's `only' a factor of 6 slower.  Given that you increased consing
by a factor of a million, that's not bad at all.

Using the stack is going to be more efficient than using the heap.  See

@techreport{ miller94garbage,
    author = "James S. Miller and Guillermo J. Rozas",
    title = "Garbage Collection is Fast, but a Stack is Faster",
    number = "AIM-1462",
    pages = "37",
    year = "1994",
    url = "citeseer.ist.psu.edu/miller94garbage.html" }

>
> I'm waiting to hear if there is a way to make the CPS code less
> expensive.

I can't think of anything obvious.  You may want to check that your GC
is tuned for this large amount of consing.  You should make your
youngest generation maybe 3/4 the size of physical memory if your
implementation allows it.

>  In the meantime, I tried to see how much of a speedup I could
> get by turning the function into special purpose stack machine.  This is
> what I came up with:
[code snipped]

I think most of the time here is wasted on the ECASE dispatch.  This is
going to turn into a big COND expression that checks the state against
each of the tags in turn.

You can inline the majority of your state transitions, so I'd start
with that.  You can use TAGBODY and GO to control the flow.  They
should turn into jump instructions.

From: Matthew D Swank
Subject: Re: Recursion CPS, and doing it yourself
Date: Thu, 16 Mar 2006 22:01:16 +0000
Message-ID: <pan.2006.03.16.22.01.16.585614@c.net>

On Thu, 16 Mar 2006 09:32:06 -0800, Joe Marshall wrote:

> I think most of the time here is wasted on the ECASE dispatch.  This is
> going to turn into a big COND expression that checks the state against
> each of the tags in turn.
> 
> You can inline the majority of your state transitions, so I'd start
> with that.  You can use TAGBODY and GO to control the flow.  They
> should turn into jump instructions.

Here's the new version:

(defun num-node-smachine (tree)
  (let* ((stack (make-array '(1000)))
         (sp (progn (setf (svref stack 0) tree) 1))
         (tree nil)
         (dump (make-array '(1000)))
         (dp 0)
         (m 0)
         (n 0)
         (rv 0)
         (ret-label 'halt)
         (jmp nil))
    (declare (type fixnum m n rv sp dp)
             (optimize (speed 3)))
    (tagbody 
     call 
       (setf (svref dump dp) ret-label    
             dp (1+ dp)                        
             (svref dump dp) tree 
             dp (1+ dp)
             sp (1- sp)
             (svref dump dp) sp
             dp (1+ dp)
             tree (svref stack sp)
             sp (1+ sp))
       (go branch)
                
     branch 
       (if (null tree)
           (go push0)
           (go push-lst))

     push0 
       (setf (svref stack sp) 0 
             sp (1+ sp))
       (go rtn)
            
     push-lst 
       (setf (svref stack sp) (cadr tree)
             sp (1+ sp)
             ret-label 'push-rst)
       (go call)
            
     push-rst 
       (setf (svref stack sp) (cddr tree) 
             sp (1+ sp)
             ret-label 'add1)
       (go call)
            
     add1 
       (setf sp (1- sp)
             m (svref stack sp)
             sp (1- sp)
             n (svref stack sp)
             (svref stack sp) (+ 1 m n)
             sp (1+ sp))
       (go rtn)
            
     rtn 
       (setf sp (1- sp)
             rv (svref stack sp)
             dp (1- dp)
             sp (svref dump dp)
             (svref stack sp) rv
             sp (1+ sp)
             dp (1- dp)
             tree (svref dump  dp)
             dp (1- dp)
             jmp (svref dump dp))
       (cond ((eq jmp 'halt)
              (go halt))
             ((eq jmp 'push-rst)
              (go push-rst))
             ((eq jmp 'add1)
              (go add1)))

     halt) 
    (svref stack (1- sp))))

CL-USER> (let ((tree (make-zero-tree 28)))
           (time (num-node6 tree)))
Evaluation took:
  21.493 seconds of real time
  21.473343 seconds of user run time
  0.0 seconds of system run time
  0 page faults and
  33,296 bytes consed.
268435455
CL-USER> 

Cut the time by about a third. I suppose if this were scheme I could get
the tagbody to jump optimization by using named thunks?
E.g.
(letrec (;... register & stack init
         (rtn (lambda ()
                (set! sp (- sp 1))
                (set! rv (vector-ref stack sp))
                (set! dp (- dp 1))
                (set! sp (vector-ref dump dp))
                (vector-set! stack sp rv)
                (set! sp (+ sp 1))
                (set! dp (- dp 1))
                (set! tree (vector-ref dump  dp))
                (set! dp (- dp 1))
                (set! jmp (vector-ref dump dp))
                (cond ((eq jmp 'halt)
                       (halt))
                      ((eq jmp 'push-rst)
                       (push-rst))
                      ((eq jmp 'add1)
                       (add1)))))
            ...more states

Matt
-- 
"You do not really understand something unless you can
 explain it to your grandmother." — Albert Einstein.

From: Matthew D Swank
Subject: Re: Recursion CPS, and doing it yourself
Date: Thu, 16 Mar 2006 22:06:07 +0000
Message-ID: <pan.2006.03.16.22.06.03.240788@c.net>

On Thu, 16 Mar 2006 16:01:16 -0600, Matthew D Swank wrote:

> (cond ((eq jmp 'halt)
>                        (halt))
>                       ((eq jmp 'push-rst)
>                        (push-rst))
>                       ((eq jmp 'add1)
>                        (add1)))

s:/eq/eq?

-- 
"You do not really understand something unless you can
 explain it to your grandmother." — Albert Einstein.

From: Joe Marshall
Subject: Re: Recursion CPS, and doing it yourself
Date: Fri, 17 Mar 2006 01:11:04 +0000
Message-ID: <1142557864.865891.134760@u72g2000cwu.googlegroups.com>

Matthew D Swank wrote:
> On Thu, 16 Mar 2006 09:32:06 -0800, Joe Marshall wrote:
>
> > I think most of the time here is wasted on the ECASE dispatch.  This is
> > going to turn into a big COND expression that checks the state against
> > each of the tags in turn.
> >
> > You can inline the majority of your state transitions, so I'd start
> > with that.  You can use TAGBODY and GO to control the flow.  They
> > should turn into jump instructions.
>
> Here's the new version:
>
> (defun num-node-smachine (tree)
[code snipped]

Try these declarations:
    (declare (type (simple-array t (1000)) stack dump)
             (type (integer 0 (1000)) sp dp)
             (optimize (speed 3) (safety 0)))

You can do it this way, but why?  I tried it out and the compiled code
looks like this:

... (lots of code omitted, this is a representative sample)
L7:  345:      8B7DFC           move  edi, [ebp-4]
     348:      C1FF06           sar   edi, 6
     351:      8B75F4           move  esi, [ebp-C]
     354:      C7443E1000000000 move  [esi+10+edi], 0
     362:      8B7DFC           move  edi, [ebp-4]
     365:      81C700010000     add   edi, 100
     371:      897DFC           move  [ebp-4], edi
L8:  374:      8B7DFC           move  edi, [ebp-4]
     377:      81EF00010000     sub   edi, 100
     383:      897DFC           move  [ebp-4], edi
     386:      8B7DFC           move  edi, [ebp-4]
     389:      C1FF06           sar   edi, 6
     392:      8B75F4           move  esi, [ebp-C]
     395:      8B7C3E10         move  edi, [esi+10+edi]
     399:      8B5DF8           move  ebx, [ebp-8]
     402:      81EB00010000     sub   ebx, 100
     408:      895DF8           move  [ebp-8], ebx
     411:      8B5DF8           move  ebx, [ebp-8]
     414:      C1FB06           sar   ebx, 6
     417:      8B75EC           move  esi, [ebp-14]
     420:      8B5C1E10         move  ebx, [esi+10+ebx]
     424:      895DFC           move  [ebp-4], ebx
     427:      8B5DFC           move  ebx, [ebp-4]
     430:      897DF0           move  [ebp-10], edi
     433:      89DF             move  edi, ebx
     435:      C1FF06           sar   edi, 6
     438:      8B75F0           move  esi, [ebp-10]
     441:      8B4DF4           move  ecx, [ebp-C]
     444:      89743910         move  [ecx+10+edi], esi
     448:      F645F007         testb [ebp-10], 7
     452:      740B             je    L9
     454:      FF75F4           push  [ebp-C]
     457:      8B45F0           move  eax, [ebp-10]
....

As you can see, the code is spending all its time moving local
variables back and forth between the stack and the registers.  This is
no doubt because of the large number of locals and the emulated stack.
But this is not going to be anywhere near as efficient as just using
the recursive code.

From: Matthew D Swank
Subject: Re: Recursion CPS, and doing it yourself
Date: Fri, 17 Mar 2006 01:39:22 +0000
Message-ID: <pan.2006.03.17.01.39.20.856107@c.net>

On Thu, 16 Mar 2006 17:11:04 -0800, Joe Marshall wrote:

> As you can see, the code is spending all its time moving local
> variables back and forth between the stack and the registers.  This is
> no doubt because of the large number of locals and the emulated stack.
> But this is not going to be anywhere near as efficient as just using
> the recursive code.

Well, it's just a finger exercise in turning recursion into iteration. 
I wouldn't want to maintain code written this way.  In fact I didn't so
much write the code as transform it.  At one point it was a more straight
forward secd like machine with the states as instructions in a code tree:

code <- (branch (push0 rtn) (push-lst call push-rst call add1 rtn))

Matt

-- 
"You do not really understand something unless you can
 explain it to your grandmother." — Albert Einstein.