From: Mike Dixon
Subject: evaluating CLOS performance
Date: 
Message-ID: <MDIXON.89Oct31193234@thelonius.PARC.xerox.com>
There have been a number of messages in different forums lately about
CLOS implementation.  David Moon's message to the X3J13 mailing list
addresses some issues having to do with the implementation of method
combination.  This message addresses some issues having to do with
proper evaluation of the performance of CLOS in general, and PCL in
particular.

While I have a number of things to say about this, there are three major
points I want to stress:

  1) Performance comparisons are difficult.  It is important to
     compare like functionalities, and be sure you understand the
     "semantics" of the operation you are measuring.

  2) PCL is not CLOS.  PCL is a portable implementation with some
     interesting architectural ideas, I believe it can be made to
     perform relatively well, but most of the current ports do not
     perform adequately.

  3) Not all ports of PCL are created equal.  The P in PCL means
     that PCL can be ported relatively easily, not that it is written
     in pure Common Lisp.

     There are many things which PCL would like to do, which Common
     Lisp compilers are reluctant to let me do.  In any given Common
     Lisp, it takes work to get the compiler "out of PCL's way".
     In some Common Lisps this process is farther along that in others,
     but in no Common Lisp is it far enough along.

The rest of this message discusses a number of aspects of the CLOS
language and PCL implementation, providing various insights and evidence
to support these three points.

An important point about CLOS which this message will develop is that
because CLOS is a tight integration with Common Lisp, different but
similar features work together well.  defstruct and defclass define
different kinds of classes, but methods can be defined on both kinds --
this makes it possible to choose the kind of class based on one set of
requirements and use generic functions regardless of the decision.
defun and defmethod define different kinds of functions, but both are
called in the same way -- this makes it possible to choose the kind of
function based on one set of requirements but have the callers be able
to use it the same way regardless of the decision.

In short, this makes it possible for a programmer using CLOS-extended
Common Lisp to choose just exactly the functionality they want in any
given case.  The language provides a range of functionalities which all
fit together.

<<An expanded version of this note would give some guidelines for making
these decisions, or at least some examples.  This version doesn't do
that unfortunately.>>

So, when making performance comparisons, it is important to compare
similar functionalities -- this corresponds to the common folk wisdom
that you shouldn't compare apples and oranges.  A performance comparison
should always include a functionality comparison.  Two common problem
areas are field access and procedure invocation.

*** Field Access ***

Let's start with the example of field access.  Many people try to make
comparisons among the performance of defstruct field access, slot access
and calling accessor generic functions.  Following the principle
outlined above it is important to understand the performance and
functionality of the many different field access mechanisms provided:

Given the following definitions:

(defstruct foo (a 1))
(defclass bar () ((a :initform 1 :accessor bar-a)))

(defvar *foo* (make-foo))
(defvar *bar* (make-instance 'bar))

Let us look at the following forms:

(defun fun ()					
  (foo-a *foo*)                ;Form 1
  (slot-value *bar* 'a)        ;Form 2
  (bar-a *bar*))               ;Form 3

(defmethod meth ((b bar))
  (slot-value *bar* 'a)        ;Form 4
  (bar-a *bar*))               ;Form 5

The rest of this section looks at comparisons among forms 1 - 5.  Each
of these forms has different functionality AND different performance.

Form 1: 
  This is a defstruct field access.  In many implementations, this
  compiles open without any type checking, so the time required is
  about one memory read.  

  The important points about the functionality of this operation are:
  it (usually) does no error checking; it only supports single
  inheritance, and because it compiles open you must recompile
  all your code if you redefine it.

Form 2:
  This is a call to slot-value, where the object argument is a an
  instance of a standard class (a class defined by defclass without
  using the :metaclass option).  Moreover, this call to slot-value
  appears in a context in which the compiler cannot infer anything
  about the class of the object argument.  In the current version of
  PCL, this takes a long time, I don't know how long.  In the next
  version, this should take an amount of time comparable to 3 or 4
  function calls, but see my comments later about porting PCL.  I
  haven't though very much about what the fastest one could expect 
  this to be is.

  The important points about the functionality of this operation are:
  it does complete error checking (is its argument an object, does it
  have a slot by this name, is that slot bound); the class involved
  supports multiple inheritance; it doesn't compile open, so you can
  change the class definition later; you can define methods on
  slot-value-using-class which would affect this operation -- so the
  slot access is specializable.

Form 3:
  This is a call to an automatically generated accessor of a
  standard class.  In the next version of PCL, this call should take
  time between that of one and two function calls, but see my comments
  later about porting PCL.  

  The important points about the functionality of this operation are:
  this is a full call to a real generic function -- you can define
  other methods on this generic function and thereby specialize it;
  error checking happens.

Form 4:
  This is like form 2 with the exception that the form appears in
  a context where the compiler can infer a lot about the object
  argument.  In the next version of PCL, this should take time between
  2 and 3 memory reads, but see my comments later about porting PCL.
  Most implementations should provide performance for this on the order
  of two memory reads.

Form 5:
  This is like form 3 with the exception that the form appears in
  a context where the compiler can infer a lot about the object
  argument.  PCL will not do any better in this situation, but other
  implementations should be able to get performance similar to that
  for form 4.


The point is that these operations have performance and functionality
differences.  The appropriate mechanism must be selected, by the
programmer, to get the functionality/performance tradeoff needed in each
situation.

Because CLOS-extended Common Lisp makes it possible to define methods on
classes defined by defstruct as well as classes defined by defstruct,
decisions about how to get field access don't affect whether it will be
possible to use generic functions.  This is an important point about the
tight integration of CLOS in Common Lisp.


*** Procedure Invocation and Type Dispatch ***

The second common comparison is between generic function invocation and
ordinary function invocation.  This comparison is also fraught with
peril.

Given the following definitions:

(defstruct foo)
(defstruct bar)
(defclass baz () ())
(defclass bee () ())

(defun f1 (x)           (defun f2 (x)           (defun f3 (x)
  (etypecase x            (etypecase x            (etypecase x
    (foo <place 1>)         (baz <place 3>)         (foo <place 5>)
    (bar <place 2>)))       (bee <place 4>)))       (baz <place 6>)))


(defmethod gf1 ((x foo)) <place 11>)
(defmethod gf1 ((x bar)) <place 12>)

(defmethod gf2 ((x baz)) <place 13>)
(defmethod gf2 ((x bee)) <place 14>)

(defmethod gf3 ((x foo)) <place 15>)
(defmethod gf3 ((x baz)) <place 16>)


The most important point here is to notice that we must compare the
performance of a generic function call to the performance of an ordinary
function call PLUS AN ETYPECASE.  It isn't appropriate to compare the
performance of an ordinary function call to a generic function call
because a generic function call does much more than a function call.
If, in a place in your code all you need is an ordinary function, then
that is what you should use, if what you need is a function call plus
some form of type dispatch you usually want to use generic functions.

Basically, when invoking a form like (f1 <object>) or (gf2 <object>),
you should expect the time it takes to get to the corresponding <place
n> mark above to be the same.  In many implementations, people will do
a better job of optimizing generic function call than typecase, so you
may even get better performance out of generic function call than
typecase.  (See comments on PCL later).

Returning for a moment to functionality, generic functions provide a
great deal of functionality that ordinary functions plus typecase
doesn't.  The most obvious is that the definitions of the individual
cases (the <place n> forms) above can be spread around -- individual
defmethod forms can appear in separate places.  Also, defmethod forms
can be added later without having access to the source of the previously
existing defmethod forms.  Also, generic functions implement inheritance
automatically.  Also, generic functions implement method combination
automatically.


Taken together, the discussion of field access and the implementation
of type specific operations show that some of the decisions a user needs
to make are:

Field Access:

  Do I want multiple or single inheritance?
  Do I want to be able to define methods on slot accessors?
  Do I want specializable primitive slot access?

Procedure invocation and type dispatch?

  Do I want genericity or just simple procedure call?
  Do I want automatic implementation of inheritance?
  Do I want automatic implementation of method combination?


It is the answers to these kinds of questions which dictate which
language feature you should use in each particular case.  Because CLOS
is a tight integration with Common Lisp, the different language features
work seamlessly together, that makes it possible for you, the programmer
to choose just the semantics you want in each case.


*** Some comments on PCL ***

This section provides one small example of the difference between PCL's
architectural performance possibilities and what happens in the current
ports.  By "architectural performance possibilities", I mean what PCL
should be able to do if I had my hands on the sources of the underlying
lisp (and I had a lot more time than I do).  By "what happens in the
current ports" I mean the performance of the current ports of PCL and
the port-specific reasons why that is different from the architectural
possibilities.

This discussion is limited to PCL on stock hardware, PCL was never
intended to be at all competitive on Lisp machines.

In particular we will look at a small aspect of one of PCL's major
tasks, implementing method lookup (forsimplicity, this discussion
ignores method combination.)

The essence of method lookup is that when a generic function is called,
the classes of the arguments is used to select an appropriate method to
be called and then that method is called.

This means that in the heart of method lookup, what PCL is doing is
checking the class of arguments, finding a method function to call, and
then calling that method function.  All of this happens inside the PCL
runtime, where the datastructures floating around have already been
checked for errors.  This means that many of the operations involved in
method lookup require no error checking.  For example, PCL can be sure
that the method function it will call is a legal function, and moreover
(using a simple trick) can make sure that it is a compiled function.
So, once the method function is looked up, the actual call to it
requires no error checking.

Looking in the PCL sources (this only appears in the new, unreleased
version) we find the following definition:

(defmacro |RUNTIME FUNCALL| (fn &rest args) 
  `(funcall (the compiled-function ,fn) ,.args))

The use of the declaration "(the compiled-function ,fn)" corresponds to
the fact that PCL is sure that this macro (|RUNTIME FUNCALL|) will never
be used unless the fn argument is in fact a compiled function.

Architecturally, PCL has enabled an optimization which should improve
the performance of method lookup.

Unfortunately, Common Lisp compilers don't respect this declaration.
The compilers emit code which first checks to ensure that the first
argument to funcall is a function before calling it.  This one thing is
just a small bit of overhead, but it directly slows down generic
function call.  Not tremendously all by itself, but there are,
unfortunately, a number of others like it.

The root of the problem is that Common Lisp doesn't provide a way for
PCL to tell the underlying Common Lisp to eliminate the error checking,
and no existing port of PCL is good enough to eliminate all the
architecture--port differences.  (I should also say that previous
versions of PCL have not had anywhere near as good an architecture).

It is reasonable to expect that better ports of newer versions of PCL
will have better performance.  It will be interesting to compare the
performance of these to implementation specific, "hand-coded"
implementations of CLOS.  I don't really know what the performance
differences will be.


*** Appendix ***

The following examples show how two, major Common Lisps for the Sun4
compile an ordinary function call and a use of funcall as shown above.

These two Common Lisps were chosen for illustrative purposes only.  No
comparison between them or with any other is intended.  No endorsement
of either Common Lisp is intended.  No criticism of either Common Lisp
is intended.  All the Common Lisps I have used generate similar code,
that is the whole point.  These two were just chosen to illustrate the
general point.

Also note that even those these two code vectors look a little
different, they are substantially the same.

These two functions can be used to illustrate the difference between the
best out of line function call a given Lisp knows how to do and the kind
of function call you get when you have to use funcall.

(proclaim '(optimize (speed 3) (safety 0) (compilation-speed 0)))

(defun foo (x y)
  (funcall (the compiled-function x) y))
(defun bar (x y) (baz y))


;One Common Lisp

;; disassembling #<Function FOO @ #x873716>
;; formals: X Y
;; constant vector:

;; code vector @ #x87344c:
   0:   save    #x-60,%o6
   4:   move.l  %i0,%l0
   8:   and     #x7,%l0,%g2
  12:   subcc   #x2,%g2,%g0
  16:   beq,a   84
  20:   move.l  %l0,%g2
  24:   and     #x7,%l0,%g2
  28:   subcc   #x6,%g2,%g0
  32:   beq,a   40
  36:   moveu.b -6(%l0),%g2
  40:   subcc   #x8,%g2,%g0
  44:   beq,a   116
  48:   move.l  %l0,%o5
  52:   move.l  %l0,%o0
  56:   move.l  219(%g4),%o7    ; FUNCALL
  60:   move.l  %i1,%o1
  64:   jmpl    0(%o7),%o7
  68:   move.l  #x2,%g3
  72:   jmpl    8(%i7),%g0
  76:   restore %g0,%o0
  80:   move.l  %l0,%g2
  84:   move.l  6(%g2),%o5
  88:   move.l  %i1,%o0
  92:   move.l  -2(%o5),%g6
  96:   jmpl    0(%g6),%o7
 100:   xor     #x1,%g0,%g3
 104:   bra     72
 108:   nop
 112:   move.l  %l0,%o5
 116:   move.l  %i1,%o0
 120:   move.l  -2(%o5),%g6
 124:   move.l  %g4,%g2
 128:   jmpl    0(%g6),%o7
 132:   xor     #x1,%g0,%g3
 136:   bra     72
 140:   nop

;; disassembling #<Function BAR @ #x87684e>
;; formals: X Y
;; constant vector:
0:      BAZ

;; code vector @ #x876784:
   0:   save    #x-60,%o6
   4:   move.l  34(%i5),%g2     ; BAZ
   8:   move.l  %i1,%o0
  12:   move.l  6(%g2),%o5
  16:   move.l  -2(%o5),%g6
  20:   jmpl    0(%g6),%o7
  24:   xor     #x1,%g0,%g3
  28:   jmpl    8(%i7),%g0
  32:   restore %g0,%o0


;Another Common Lisp

(disassemble 'foo)

        move        %in0, %loc0
        move        %in1, %in0
        move        %loc0, %ncp
        move        [%sq + 911], %nra   ; SQ-COERCE-TO-PROCEDURE
        jmpl        %nra, %nra - 2
        nop
        move        [%ncp - 2], %nra    ; Function Code
        move        4, %u0
        jmpl        %0, %nra - 2
        move        %ncp, %cp

(disassemble 'bar)

        move        %in1, %in0
        move        [%cp + 30], %x0     ; BAZ
        move        [%x0 + 13], %cp
        move        [%cp - 2], %nra     ; Function Code
        jmpl        %0, %nra - 2
        move        4, %u0