with-slots slow

From: Tamas K Papp
Subject: with-slots slow
Date: Mon, 27 Oct 2008 13:10:57 +0000
Message-ID: <6mlsv1FhetnnU1@mid.individual.net>

I have found classes very useful for organizing complex computations, and 
the with-slots macro is quite convenient for accessing slots.  Recently I 
started profiling, and realized that it is also slowing things down: 
sometimes I only need to read slots, but there is still a function call 
in there.

Here is some example code:

(defclass myclass ()
  ((s :initform 12d0)))

(defun benchmark (n)
  "Without slot access, so we can subtract this as the fixed time."
  (declare (fixnum n))
  (let ((sum 0d0))
    (declare (double-float sum))
    (dotimes (i n)
      (incf sum (+ 1.8d0 (random 5d0))))
      sum))

(defun foo1 (instance n)
  "Using with-slots."
  (declare (fixnum n))
  (with-slots (s) instance
    (let ((sum 0d0))
      (declare (double-float sum s))
      (dotimes (i n)
	(incf sum (+ s (random 5d0))))
      sum)))

(defun foo2 (instance n)
  "Using let."
  (declare (fixnum n))
  (let ((sum 0d0)
	(s (slot-value instance 's)))
    (declare (double-float sum s))
    (dotimes (i n)
      (incf sum (+ s (random 5d0))))
    sum))

Using SBCL,

(defparameter *n* 50000000)
(time (benchmark *n*))			; 2.57s
(time (foo1 (make-instance 'myclass) *n*)) ; 3.78s
(time (foo2 (make-instance 'myclass) *n*)) ; 2.42s, clearly some
                                           ; measurement error in there

So I came up with a macro, which is really a trivial modification for 
with-slots for cases when I need only need to read slots:

(defmacro with-slots* (slots instance &body body)
  "Macro similar to with-slots, but with the slots made read-only."
  (let ((in (gensym)))
    `(let ((,in ,instance))
       (declare (ignorable ,in))
       (let ,(mapcar (lambda (slot-entry)
		       (let ((var-name
			      (if (symbolp slot-entry)
				  slot-entry
				  (car slot-entry)))
			     (slot-name
			      (if (symbolp slot-entry)
				  slot-entry
				  (cadr slot-entry))))
			 `(,var-name
			   (slot-value ,in ',slot-name))))
		     slots)
	 ,@body))))

Replacing with-slots with with-slots* in places where it makes sense lead 
to a 40% improvement in the speed of my calculations.

Tamas

Re: with-slots slow Raffael Cavallaro
Re: with-slots slow Rainer Joswig
- Re: with-slots slow Willem Broekema
  - Re: with-slots slow Nikodemus Siivola
Re: with-slots slow Pascal Costanza
Re: with-slots slow Pascal Costanza
Re: with-slots slow Alex Mizrahi
- Re: with-slots slow Tamas K Papp
- Re: with-slots slow Pascal Costanza
  - Re: with-slots slow Alex Mizrahi
    - Re: with-slots slow Dimiter "malkia" Stanev
    - Re: with-slots slow Pascal Costanza
Re: with-slots slow Nikodemus Siivola

From: Raffael Cavallaro
Subject: Re: with-slots slow
Date: Mon, 27 Oct 2008 13:55:04 +0000
Message-ID: <ge4h7o$qb4$1@aioe.org>

On 2008-10-27 09:10:57 -0400, Tamas K Papp <······@gmail.com> said:

> 
> Replacing with-slots with with-slots* in places where it makes sense lead
> to a 40% improvement in the speed of my calculations.

I only find about a 15% speedup using your with-slots* under LispWorks 
on an intel 32-bit mac. Maybe this is more of an sbcl specific problem 
better addressed on the sbcl-devel mailing list?

From: Rainer Joswig
Subject: Re: with-slots slow
Date: Mon, 27 Oct 2008 12:44:04 +0000
Message-ID: <joswig-26F93A.14440427102008@news-europe.giganews.com>

In article <··············@mid.individual.net>,
 Tamas K Papp <······@gmail.com> wrote:

> I have found classes very useful for organizing complex computations, and 
> the with-slots macro is quite convenient for accessing slots.  Recently I 
> started profiling, and realized that it is also slowing things down: 
> sometimes I only need to read slots, but there is still a function call 
> in there.
> 

That's no surprise, I'd say. Accessing a local variable
is very likely to be faster than accessing slot values. WITH-SLOTS
is just a shorter notation - shorter than repeated
SLOT-VALUE forms. But there is no caching...

SLOT access to CLOS object is usually a bit slower than
slot access to structures. For structures the Lisp
system can/will assume that the position of the slot
does not change. With CLOS slots one usually can't
assume that. There are some extensions to Common Lisp,
where one can declare fixed slot positions for CLOS
classes.

Access to variables should be faster to both CLOS slots
and structure slots.

-- 
http://lispm.dyndns.org/

From: Willem Broekema
Subject: Re: with-slots slow
Date: Tue, 28 Oct 2008 10:40:28 +0000
Message-ID: <c8f9feee-e004-4cd8-8e15-b4f2a71d81a9@e17g2000hsg.googlegroups.com>

On Oct 27, 1:44 pm, Rainer Joswig <······@lisp.de> wrote:
> SLOT access to CLOS object is usually a bit slower than
> slot access to structures. For structures the Lisp
> system can/will assume that the position of the slot
> does not change. With CLOS slots one usually can't
> assume that. There are some extensions to Common Lisp,
> where one can declare fixed slot positions for CLOS
> classes.

The MOP gives you standard-instance-access, not sure if that is the
extension you mean. Using it makes slot lookup wickedly fast:

(eval-when (:compile-toplevel :execute)
  (defun class-slot-ix (name class)
    (let* ((cls (find-class class))
           (slot (or (find name (closer-mop:class-slots cls)
                           :key #'closer-mop:slot-definition-name)
                     (error "Class ~A has no slot named ~A." cls
name))))
      (closer-mop:slot-definition-location slot))))

(defun bar (instance n)
  (declare (fixnum n)
           (optimize (speed 3) (safety 0) (debug 0)))
  (symbol-macrolet ((s (closer-mop:standard-instance-access instance #.
(class-slot-ix 's 'myclass))))
    (let ((sum 0d0))
      (declare (double-float sum))
      (dotimes (i n)
        (incf sum (+ (the double-float s) (random 5d0))))
      sum)))

Results, all with same optimize declaration:

 (time (foo1 (make-instance 'myclass) *n*)) = 3.897   with-slots
 (time (bar  (make-instance 'myclass) *n*)) = 2.712   standard-
instance-access
 (time (foo2 (make-instance 'myclass) *n*)) = 2.395   let


- Willem

From: Nikodemus Siivola
Subject: Re: with-slots slow
Date: Tue, 28 Oct 2008 11:19:04 +0000
Message-ID: <4906F528.60005@random-state.net>

Willem Broekema wrote:

> The MOP gives you standard-instance-access, not sure if that is the
> extension you mean. Using it makes slot lookup wickedly fast:

Be careful here! If the instance comes from a subclass of MYCLASS, the slot 
location might be different. (See COMPUTE-SLOTS for a way to make sure the 
slots are in the order you need them to be.)

Cheers,

  -- Nikodemus

From: Pascal Costanza
Subject: Re: with-slots slow
Date: Mon, 27 Oct 2008 15:26:32 +0000
Message-ID: <6mm4t8Fg923gU1@mid.individual.net>

John Thingstad wrote:
> P� Mon, 27 Oct 2008 14:10:57 +0100, skrev Tamas K Papp <······@gmail.com>:
> 
>> I have found classes very useful for organizing complex computations, and
>> the with-slots macro is quite convenient for accessing slots.  Recently I
>> started profiling, and realized that it is also slowing things down:
>> sometimes I only need to read slots, but there is still a function call
>> in there.
>>
> 
> Using with-slots is bad form anyhow as it breaks the abstraction created 
> by accessors. What if you changed the type to a computed type? 
> with-accessors is better in that respect, but there is no guarantee it 
> will be faster. (Depends on the implementation.)

Slot access via accessors is likely to be faster than via slot-value. 
Accessors should also be preferred due to potential 
:before/:after/:around methods on the accessors. Access via slot-value 
should be reserved for low-level access, for example as part of internal 
implementation details for a specific class.

(Unless you really know what you're doing, of course, as always. ;)

Pascal

-- 
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/

From: Pascal Costanza
Subject: Re: with-slots slow
Date: Mon, 27 Oct 2008 15:28:07 +0000
Message-ID: <6mm507Fg923gU2@mid.individual.net>

Tamas K Papp wrote:

> So I came up with a macro, which is really a trivial modification for 
> with-slots for cases when I need only need to read slots:
> 
> (defmacro with-slots* (slots instance &body body)
>   "Macro similar to with-slots, but with the slots made read-only."
>   (let ((in (gensym)))
>     `(let ((,in ,instance))
>        (declare (ignorable ,in))
>        (let ,(mapcar (lambda (slot-entry)
> 		       (let ((var-name
> 			      (if (symbolp slot-entry)
> 				  slot-entry
> 				  (car slot-entry)))
> 			     (slot-name
> 			      (if (symbolp slot-entry)
> 				  slot-entry
> 				  (cadr slot-entry))))
> 			 `(,var-name
> 			   (slot-value ,in ',slot-name))))
> 		     slots)
> 	 ,@body))))
> 
> Replacing with-slots with with-slots* in places where it makes sense lead 
> to a 40% improvement in the speed of my calculations.

This macro has the disadvantage that you won't see potential side 
effects on the slots anymore (and that's where your speedup comes from). 
If you know that your code is (mostly) functional, then that's ok.

But also keep my comment about low-level slot-value vs high-level 
accessors in mind.


Pascal

-- 
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/

From: Alex Mizrahi
Subject: Re: with-slots slow
Date: Mon, 27 Oct 2008 15:43:44 +0000
Message-ID: <4905e1b1$0$90276$14726298@news.sunsite.dk>

 TKP> I have found classes very useful for organizing complex computations,
 TKP> and the with-slots macro is quite convenient for accessing slots.
 TKP> Recently I started profiling, and realized that it is also slowing
 TKP> things down: sometimes I only need to read slots, but there is still a
 TKP> function call in there.

is it surprising?

 theoretically "sufficiently smart compiler" could detect that you're only 
reading
slots and use local variables, avoiding expensive object slot lookups.

however, in presence of multiprocessing that would have different semantics,
as slot could be changed by other thread. so this becomes a tradeoff --  
either
clean and transparent semantics or aggressive optimization that could cause
some wtf moments.

top speed was never a top priority of CL, so choice to have transparent
semantics is understandable (and that is also less work to do :))

on the other hand, i can't say that 2x slowdown of slot access is too big
overhead..

From: Tamas K Papp
Subject: Re: with-slots slow
Date: Mon, 27 Oct 2008 15:56:09 +0000
Message-ID: <6mm6kpFhcrcdU3@mid.individual.net>

On Mon, 27 Oct 2008 17:43:44 +0200, Alex Mizrahi wrote:

> TKP> I have found classes very useful for organizing complex
> computations,
>  TKP> and the with-slots macro is quite convenient for accessing slots.
>  TKP> Recently I started profiling, and realized that it is also slowing
>  TKP> things down: sometimes I only need to read slots, but there is
>  still a TKP> function call in there.
> 
> is it surprising?

No, not really, once I looked under the hood.

> top speed was never a top priority of CL, so choice to have transparent
> semantics is understandable (and that is also less work to do :))

Still, CL is the fastest practical high-level language* I know of.  I 
found that with a bit of profiling and tweaking, CL programs can be 
superfast.

Tamas

*Yes, I know, languages don't have speed, only implementations do.  You 
know what I mean.

From: Pascal Costanza
Subject: Re: with-slots slow
Date: Mon, 27 Oct 2008 16:08:21 +0000
Message-ID: <6mm7blFhglp8U1@mid.individual.net>

Alex Mizrahi wrote:
>  TKP> I have found classes very useful for organizing complex computations,
>  TKP> and the with-slots macro is quite convenient for accessing slots.
>  TKP> Recently I started profiling, and realized that it is also slowing
>  TKP> things down: sometimes I only need to read slots, but there is still a
>  TKP> function call in there.
> 
> is it surprising?
> 
>  theoretically "sufficiently smart compiler" could detect that you're only 
> reading
> slots and use local variables, avoiding expensive object slot lookups.
> 
> however, in presence of multiprocessing that would have different semantics,
> as slot could be changed by other thread. so this becomes a tradeoff --  
> either
> clean and transparent semantics or aggressive optimization that could cause
> some wtf moments.

You don't have to go as far as adding multithreading here. A (direct or 
indirect) call to another function could already have a side effect on 
the object in question.


Pascal

-- 
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/

From: Alex Mizrahi
Subject: Re: with-slots slow
Date: Mon, 27 Oct 2008 16:21:09 +0000
Message-ID: <4905ea76$0$90270$14726298@news.sunsite.dk>

 ??>> however, in presence of multiprocessing that would have different
 ??>> semantics, as slot could be changed by other thread. so this becomes a
 ??>> tradeoff --  either clean and transparent semantics or aggressive
 ??>> optimization that could cause some wtf moments.

 PC> You don't have to go as far as adding multithreading here. A (direct or
 PC> indirect) call to another function could already have a side effect on
 PC> the object in question.

couldn't "sufficiently smart compiler" detect that in such code snippet:

(dotimes (i n)
 (incf sum (+ s (random 5d0))))

object instance is not going to be changed?

From: Dimiter "malkia" Stanev
Subject: Re: with-slots slow
Date: Mon, 27 Oct 2008 19:10:41 +0000
Message-ID: <ge53nj$64v$1@registered.motzarella.org>

Alex Mizrahi wrote:
>  ??>> however, in presence of multiprocessing that would have different
>  ??>> semantics, as slot could be changed by other thread. so this becomes a
>  ??>> tradeoff --  either clean and transparent semantics or aggressive
>  ??>> optimization that could cause some wtf moments.
> 
>  PC> You don't have to go as far as adding multithreading here. A (direct or
>  PC> indirect) call to another function could already have a side effect on
>  PC> the object in question.
> 
> couldn't "sufficiently smart compiler" detect that in such code snippet:
> 
> (dotimes (i n)
>  (incf sum (+ s (random 5d0))))
> 
> object instance is not going to be changed? 
> 
> 

Unless the compiler somehow knows that the "random" function won't 
change anything, it will... But what if random stores it's seed value in 
a slot of a global object, how the compiler would know if that's not the 
object where the slot "s" is coming from?

I guess there is some relation here to the open-coding - e.g. the 
compiler (internally) knows more about the symbols in the "CL" package, 
and in this case it knows that although random has state, that state 
cannot be accessed normally through the object that has the slot "s".

From: Pascal Costanza
Subject: Re: with-slots slow
Date: Mon, 27 Oct 2008 16:30:28 +0000
Message-ID: <6mm8l4Fhl34nU1@mid.individual.net>

Alex Mizrahi wrote:
>  ??>> however, in presence of multiprocessing that would have different
>  ??>> semantics, as slot could be changed by other thread. so this becomes a
>  ??>> tradeoff --  either clean and transparent semantics or aggressive
>  ??>> optimization that could cause some wtf moments.
> 
>  PC> You don't have to go as far as adding multithreading here. A (direct or
>  PC> indirect) call to another function could already have a side effect on
>  PC> the object in question.
> 
> couldn't "sufficiently smart compiler" detect that in such code snippet:
> 
> (dotimes (i n)
>  (incf sum (+ s (random 5d0))))
> 
> object instance is not going to be changed? 

Maybe. But if the efficiency in this case matters that much, you will 
probably also notice this in your profiler, and might as well optimize 
this by hand. ;)


Pascal

-- 
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/

From: Nikodemus Siivola
Subject: Re: with-slots slow
Date: Mon, 27 Oct 2008 17:38:59 +0000
Message-ID: <ge4ubj$efg$1@nyytiset.pp.htv.fi>

Tamas K Papp wrote:

> (defun foo1 (instance n)
>   "Using with-slots."
>   (declare (fixnum n))
>   (with-slots (s) instance
>     (let ((sum 0d0))
>       (declare (double-float sum s))
>       (dotimes (i n)
> 	(incf sum (+ s (random 5d0))))
>       sum)))
> 
> (defun foo2 (instance n)
>   "Using let."
>   (declare (fixnum n))
>   (let ((sum 0d0)
> 	(s (slot-value instance 's)))
>     (declare (double-float sum s))
>     (dotimes (i n)
>       (incf sum (+ s (random 5d0))))
>     sum))

FOO1 and FOO2 are not doing quite the same thing: in FOO2 you have performed 
manual loop-invariant optimization by lifting out reading the slot from the loop.

Now, granted, SBCL does not currently do loop code motion at all -- but even 
if it did, the compiler could not do it in this case: calling SLOT-VALUE only 
once would be an illegal program transformation. If INSTANCE was known to be a 
structure instance, then the transformation would be legal.

In case of a CLOS instance... there are number of things to worry about: (1) 
SLOT-VALUE-USING-CLASS (2) slot non-existence and methods on SLOT-MISSING (3) 
slot unboundedness and methods on SLOT-UNBOUND. There may be more, but these 
at least come to mind. The point is that given these things the program can 
behave differently with and without the transformation -- hence it cannot be 
done. (That said, I do think that writing stuff that would be affected by a 
transformation like that sounds highly bogus, and if the spec was a work in 
progress I would probably vote in favor of explicitly allowing things this.)

Incidentally, you might want to have a look at 
http://www.sbcl.org/manual/Slot-access.html#Slot-access, which gives some 
hints on how to optimize slot access speed. Specifically,

(defmethod foo3 ((instance myclass) n)
   "Using with-slots in defmethod specialized on the instance."
   (declare (fixnum n))
   (with-slots (s) instance
     (let ((sum 0d0))
       (declare (double-float sum))
       (dotimes (i n)
	(incf sum (+ (the double-float s) (random 5d0))))
       sum)))

is much closer to FOO2 then FOO1, since the slot access becomes just two 
memory indirections (well, approximately) thanks to permutation vector 
optimizations.

(Using THE instead of DECLARE is a workaround for a bug in those selfsame 
permutation vector optimizations -- the type information from the declaration 
was lost during the optimization. Patch on sbcl-devel, going in CVS in a week 
or so, after the 1.0.22 release.)

Cheers,

  -- Nikodemus