From: Rob Warnock
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <r_mdnXCQIsVnKyXVnZ2dnUVZ_oHinZ2d@speakeasy.net>
Andreas Davour  <·······@updateLIKE.uu.HELLse> wrote:
+---------------
| I've just spent three days in a application tuning seminar, learning
| everything about cache lines, cache hits, inlining functions, unrolling
| loops, openMP and more of that kind of techniques. ...
+---------------

The only thing I can contribute to that is:

1. Compiled CLs are not always careful about placing functions
   on optimal cache line boundaries.

2. Item #1 can make a *significant* difference in runtime.

For example, in CMUCL, consider these successive trials:

    cmu> (time (dotimes (i 2000000000)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   4.45f0 seconds of real time
    ;   4.374417f0 seconds of user run time
    ;   4.04f-4 seconds of system run time
    ;   8,243,655,669 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu> (time (dotimes (i 2000000000)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   4.42f0 seconds of real time
    ;   4.373149f0 seconds of user run time
    ;   5.6f-5 seconds of system run time
    ;   8,200,990,023 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu> (time (dotimes (i 2000000000)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   2.95f0 seconds of real time
    ;   2.919542f0 seconds of user run time
    ;   0.002418f0 seconds of system run time
    ;   5,459,768,540 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu> (time (dotimes (i 2000000000)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   4.5f0 seconds of real time
    ;   4.365597f0 seconds of user run time
    ;   0.007846f0 seconds of system run time
    ;   8,362,612,228 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu>

Note that the same loop took ~4, ~4, ~2, & ~4 CPU cycles/iteration.
Why? Because the third compilation was more optimally placed for
the CPU's branch prediction (Athlon-32, as it happens).

I know of no way to tell CMUCL how to align generated code.
YMMV with other implementations.


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: Don Geddis
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <878wufuqh0.fsf@geddis.org>
····@rpw3.org (Rob Warnock) wrote on Fri, 29 Aug 2008:
> For example, in CMUCL, consider these successive trials:
>     cmu> (time (dotimes (i 2000000000)))
> Note that the same loop took ~4, ~4, ~2, & ~4 CPU cycles/iteration.

Shouldn't you first compile code, that you want to time?

> Why? Because the third compilation was more optimally placed for
> the CPU's branch prediction (Athlon-32, as it happens).

If you're running interpreted code, how can you be so certain that this
is the correct explanation for the difference in run times?

        -- Don
_______________________________________________________________________________
Don Geddis                  http://don.geddis.org/               ···@geddis.org
I think there should be something in science called the "reindeer effect."  I
don't know what it would be, but I think it'd be good to hear someone say,
"Gentlemen, what we have here is a terrifying example of the reindeer effect."
	-- Deep Thoughts by Jack Handey [SNL]
From: Rob Warnock
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <YtidndpTnYwnrSTVnZ2dnUVZ_g2dnZ2d@speakeasy.net>
Don Geddis  <···@geddis.org> wrote:
+---------------
| ····@rpw3.org (Rob Warnock) wrote on Fri, 29 Aug 2008:
| > For example, in CMUCL, consider these successive trials:
| >     cmu> (time (dotimes (i 2000000000)))
| > Note that the same loop took ~4, ~4, ~2, & ~4 CPU cycles/iteration.
| 
| Shouldn't you first compile code, that you want to time?
+---------------

WQell, what I wonder is why you failed to quote the lines from my post
that showed that each run *WAS* in fact being compiled [which CMUCL does
by default for each call to TIME]?!?!?

    cmu> (time (dotimes (i 2000000000)))
    ; Compiling LAMBDA NIL:                  <=== THESE TWO LINES
    ; Compiling Top-Level Form:              <=== THESE TWO LINES

    ; Evaluation took:
    ;   2.73f0 seconds of real time
    ;   2.720226f0 seconds of user run time
    ;   0.0f0 seconds of system run time
    ;   6,045,046,240 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu>

+---------------
| > Why? Because the third compilation was more optimally placed for
| > the CPU's branch prediction (Athlon-32, as it happens).
| 
| If you're running interpreted code, how can you be so certain that
| this is the correct explanation for the difference in run times?
+---------------

1. Because it's *NOT* "interpreted code" (see above).

2. Because I spent quite a bit of time one afternoon examining this
   situation at length. [See my parallel reply to "verec" for details.]


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: verec
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <48b8f8d0$0$627$5a6aecb4@news.aaisp.net.uk>
On 2008-08-30 03:56:58 +0100, ····@rpw3.org (Rob Warnock) said:

> Note that the same loop took ~4, ~4, ~2, & ~4 CPU cycles/iteration.
> Why? Because the third compilation was more optimally placed for
> the CPU's branch prediction (Athlon-32, as it happens).

Don't you think your example would be more convincing if you were
to show us a) that the memory address at which each instance of your
code has been compiled is indeed cache friendly, b) that cache
friendliness is preserved accross VM mapping, and c) that no other
process on your system was stealing time away, 3 runs out of 4 ?
--
JFB
From: Rob Warnock
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <itadnRbJmcf6GCTVnZ2dnUVZ_uednZ2d@speakeasy.net>
verec  <·····@mac.com> wrote:
+---------------
| ····@rpw3.org (Rob Warnock) said:
| > Note that the same loop took ~4, ~4, ~2, & ~4 CPU cycles/iteration.
| > Why? Because the third compilation was more optimally placed for
| > the CPU's branch prediction (Athlon-32, as it happens).
| 
| Don't you think your example would be more convincing...
+---------------

Well, I really wasn't trying to "convince" anyone, just suggest
to Andreas Davour that alignment of compiled code was something
he might want to look at. But if you insist...  ;-}  ;-}

NOTE: For the following examples, I'll still be using 32-bit CMUCL-19c
as before, but since my laptop is packed up at the moment what follows
will be done on my desktop machine's Athlon-64, not my laptop's Athlon-32
[which is what I used yesterday]. The main difference is that on the
Athlon-64 the "good" & "bad" cycle counts per iteration are "2" & "3"
(though still in 32-bit mode). [Plus, it's ~1.6 times as fast as the
laptop.] But the general effect of alignment on performance is similar.

+---------------
| ...if you were to show us a) that the memory address at which each
| instance of your code has been compiled is indeed cache friendly, ...
+---------------

Well, every instance was *NOT* cache friendly, that was the whole point!

But if you want examples of alignments that are reproducably fast versus
reproducably slow, then grab yourself a copy of CMUCL and follow along
at home as I reproduce this. [I'm not going to copy/paste the entire
transcript, only key bits.]

We start by compiling (say) 20 copies of a canonical DOTIMES test
function. I use 2000000000 as the count since that's close to the
maximum that CMUCL will force into native 32-bit integer arithmetic
(when compiled) without explicit declarations.

    cmu> (defparameter *funcs* (coerce
				(loop repeat 20 collect
				  (compile nil (lambda ()
						 (dotimes (i 2000000000)))))
				'vector))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 
    ...[and so on]...

    *FUNCS*
    cmu>

The key inner loop here is only 4 instructions (11 bytes), labeled below
with offsets 0x51-0x5A [note that the "JB" is a 2-byte instruction]:

    cmu> (disassemble (aref *funcs* 0))
    48B68C28:       .ENTRY "LAMBDA NIL"()        ; (FUNCTION NIL NULL)
    ...[rest of prolog]...
      4A:       MOV     EAX, 0               ; No-arg-parsing entry point
      4F:       JMP     L1
    ;;; [2] (DOTIMES (I 2000000000))
      51: L0:   INC     EAX                  ; [:BLOCK-START]
      52: L1:   MOV     ECX, EAX             ; [:BLOCK-START]
      54:       CMP     ECX, 2000000000
      5A:       JB      L0
    ...[epilog]...
    cmu> (time (funcall (aref *funcs* 0)))
    ; Compiling LAMBDA NIL: 
    ; Compiling Top-Level Form: 

    ; Evaluation took:
    ;   1.85f0 seconds of real time
    ;   1.813294f0 seconds of user run time
    ;   0.0f0 seconds of system run time
    ;   4,103,287,566 CPU cycles
    ;   0 page faults and
    ;   0 bytes consed.
    ; 
    NIL
    cmu> 

The alignment above is one of the "good" ones that for an Athlon gives
only 2 CPU clocks/iteration (plus a couple of percent of total overhead).
"Bad" alignments give 3 CPU clocks/iteration, as we'll see shortly.

Now TIME all 20 of the functions [you can safely ignore the warnings
you'll get about the "TIME form in a non-null environment, forced to
interpret." That only means that the FUNCALL and the AREF are interpreted,
not the compiled LAMBDAs in *FUNCS* themselves]. From the copious output
that results I've extracted just the "CPU cycles" lines:

    cmu> (dotimes (i 20)
	   (time (funcall (aref *funcs* i))))
    ...
    ;   4,034,516,892 CPU cycles
    ;   6,065,078,446 CPU cycles
    ;   6,058,044,433 CPU cycles
    ;   4,066,357,764 CPU cycles
    ;   4,055,854,124 CPU cycles
    ;   4,043,935,574 CPU cycles
    ;   6,040,178,270 CPU cycles
    ;   4,029,870,642 CPU cycles
    ;   6,065,532,242 CPU cycles
    ;   6,037,903,830 CPU cycles
    ;   6,113,910,026 CPU cycles
    ;   6,066,510,660 CPU cycles
    ;   6,040,727,000 CPU cycles
    ;   4,028,146,010 CPU cycles
    ;   6,046,813,682 CPU cycles
    ;   6,062,487,970 CPU cycles
    ;   4,038,351,352 CPU cycles
    ;   6,035,721,860 CPU cycles
    ;   6,039,610,976 CPU cycles
    ;   6,065,662,764 CPU cycles
    ...
    cmu> 

Since we're looping two billion times, that says that (as expected)
a given copy of the function takes either two or three CPU cycles
per iteration of the inner loop. If you put those lines into an array
and pick out the leading digits you get:

    cmu> (loop repeat 20 for i from 4 by 33
	   collect (parse-integer (subseq *cycle-lines* i (1+ i))))

    (4 6 6 4 4 4 6 4 6 6 6 6 6 4 6 6 4 6 6 6)
    cmu> (defparameter *cycles* *)

    *CYCLES*
    cmu> (count 4 *cycles*)

    7
    cmu> 

(Aside: If CMUCL compiled code alignment were uniformly random, this would
 suggest that roughly 1/3 of the alignments are "good" [2 CPU cycles
 per iteration] and the other 2/3 are "bad" [3 CPU cycles/iteration].
 But in fact [skipping ahead a bit] it's really more like a 50/50 split,
 and the above is skewed only due to an overly-small sample size.)

To help look at the alignments in detail, let's collect all the
disassemblies:

    cmu> (defparameter *dis-lines*
	   (coerce
	    (let ((dis (with-output-to-string (*standard-output*)
			 (dotimes (i 20)
			   (disassemble (aref *funcs* i))))))
	      (with-input-from-string (s dis)
		(loop for line = (read-line s nil nil)
		      while line
		  collect line)))
	    'vector))

    *DIS-LINES*
    cmu> (length *dis-lines*)

    640
    cmu>

So each one is 32 lines long. [We'll need that later.]

    cmu> (loop for line across *dis-lines*
	   when (search ".ENTRY" line )
	     collect line)

    ("48B68C28:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "4800B620:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "480244B0:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "4803AD98:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "480519D8:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "4806A4B8:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "48080E80:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "48097838:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "480AE360:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "480C5130:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "480DBA20:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "480F2360:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "48108CE0:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "4811F668:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "48136360:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "4814CDB0:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "48163838:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "4817A600:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "48191290:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)"
     "481A7EA0:       .ENTRY \"LAMBDA NIL\"()        ; (FUNCTION NIL NULL)")
    cmu> 

CMUCL-19c always allocates objects on 8-byte boundaries, as you can see.
The cache line size of the Athlon is 64 bytes, so there should be only 
8 cases we need to distinguish. Let's match them up against "fast" and
"slow" execution:

    cmu> (loop for line in *
	   collect (mod (parse-integer (subseq line 6 8) :radix 16)
			64))

    (40 32 48 24 24 56 0 56 32 48 32 32 32 40 32 48 56 0 16 32)
    cmu> (defparameter *offsets* *)

    *OFFSETS*
    cmu> (loop for cycles in *cycles*
	       and offset in *offsets*
	   collect (list cycles offset))

    ((4 40) (6 32) (6 48) (4 24) (4 24) (4 56) (6 0) (4 56) (6 32) (6 48)
     (6 32) (6 32) (6 32) (4 40) (6 32) (6 48) (4 56) (6 0) (6 16) (6 32))
    cmu> (stable-sort (copy-list *) #'< :key #'first)

    ((4 40) (4 24) (4 24) (4 56) (4 56) (4 40) (4 56) (6 32) (6 48) (6 0)
     (6 32) (6 48) (6 32) (6 32) (6 32) (6 32) (6 48) (6 0) (6 16) (6 32))
    cmu> (stable-sort (copy-list *) #'< :key #'second)

    ((6 0) (6 0) (6 16) (4 24) (4 24) (6 32) (6 32) (6 32) (6 32) (6 32)
     (6 32) (6 32) (4 40) (4 40) (6 48) (6 48) (6 48) (4 56) (4 56) (4 56))
    cmu> 

This is very interesting: When the offset of the start of the function
is a multiple of 16, the code runs slow; when the offset is 8 *plus*
a multiple of 16, the code runs fast.

Is this reproduceable? I won't show all my experiments here, but the
answer is a definite "Yes!!". If you repeat the timing tests -- *without*
doing a GC!!! -- the times stay the same, and the corresponences between
time & function index stay the same:

    cmu> (defparameter *time-lines*
	   (coerce
	    (let ((dis (with-output-to-string (*trace-output*)
			 (dotimes (i 20)
			   (time (funcall (aref *funcs* i)))))))
	      (with-input-from-string (s dis)
		(loop for line = (read-line s nil nil)
		      while line
		  collect line)))
	    'vector))
    ...[lots of warnings about "TIME form in a non-null environment"]...
    *TIME-LINES*
    cmu> (loop for line across *time-lines*
	   when (search "CPU cycles" line)
	     collect (parse-integer line :start 4 :end 5))

    (4 6 6 4 4 4 6 4 6 6 6 6 6 4 6 6 4 6 6 6)
    cmu> (equal * *cycles*)

    T
    cmu> 

So functions which were fast just after being compiled stay fast
upon repeated execution, and functions which were slow stay slow.

From the disassembly above, we know that the code of the inner loop of
the DOTIMES is 11 bytes long. There are only two possible alignments
of interest, "16*n" and "(16*n)+8", but why is the "(16*n)+8" alignment
the *fast* one?!?

For that we have to look at the alignment of the inner loop code itself,
*not* the alignments of the start of the functions. The inner loop code
is at disassembly lines 13, 15, 16, & 17 [from doing a full disassembly
and manually counting, not any predetermined magic], so you can examine
all of them this way [I'll only show the first few]:

    cmu> (loop for j from 0 by 32 below (length *dis-lines*)
	       and time in *cycles*
	   do (format t "===== ~s =====~%" (if (= time 4) "FAST" "SLOW"))
	      (loop for i in '(0 13 15 16 17)
		do (write-line (aref *dis-lines* (+ i j)))))
    ===== "FAST" =====
    48B68C28:       .ENTRY "LAMBDA NIL"()        ; (FUNCTION NIL NULL)
	  51: L0:   INC     EAX                  ; [:BLOCK-START]
	  52: L1:   MOV     ECX, EAX             ; [:BLOCK-START]
	  54:       CMP     ECX, 2000000000
	  5A:       JB      L0
    ===== "SLOW" =====
    4800B620:       .ENTRY "LAMBDA NIL"()        ; (FUNCTION NIL NULL)
	  49: L0:   INC     EAX                  ; [:BLOCK-START]
	  4A: L1:   MOV     ECX, EAX             ; [:BLOCK-START]
	  4C:       CMP     ECX, 2000000000
	  52:       JB      L0
    ===== "SLOW" =====
    480244B0:       .ENTRY "LAMBDA NIL"()        ; (FUNCTION NIL NULL)
	  D9: L0:   INC     EAX                  ; [:BLOCK-START]
	  DA: L1:   MOV     ECX, EAX             ; [:BLOCK-START]
	  DC:       CMP     ECX, 2000000000
	  E2:       JB      L0
    ===== "FAST" =====
    4803AD98:       .ENTRY "LAMBDA NIL"()        ; (FUNCTION NIL NULL)
	  C1: L0:   INC     EAX                  ; [:BLOCK-START]
	  C2: L1:   MOV     ECX, EAX             ; [:BLOCK-START]
	  C4:       CMP     ECX, 2000000000
	  CA:       JB      L0
    ...[trimmed]...
    cmu> 

So for the "fast" cases the inner loop code never crosses a 16-byte
boundary, and for the "slow" cases it does. But why is a 16-byte
boundary important, and not the 64-byte cache line size? A possible
answer for that I dug out of an old copy of Microprocessor Report
on the AMD K7, which notes that the K7 pipeline fetches 16 bytes of
instructions from the primary I-cache per cycle. *AHA!* All becomes clear!
[Yes, I know I'm running on a K8, but I'm guessing it's the same.]

As further confirmation, if one now forces a GC [CMUCL has a copying GC],
then all those functions will get moved around, and the alignments will
all change, and with them the run-times under TIME. But what does *not*
change is the pairing between the new function alignments and which
functions now run "fast" or "slow":

    cmu> (gc :full t)

    ; [GC threshold exceeded with 6,902,768 bytes in use.  Commencing GC.]
    ; [GC completed with 1,441,256 bytes retained and 5,461,512 bytes freed.]
    ; [GC will next occur when at least 13,441,256 bytes are in use.]

    NIL
    cmu> (gc :full t)
    ; [GC threshold exceeded with 1,445,288 bytes in use.  Commencing GC.]
    ; [GC completed with 1,437,264 bytes retained and 8,024 bytes freed.]
    ; [GC will next occur when at least 13,437,264 bytes are in use.]

    NIL
    cmu> (defparameter *dis-lines/2*
	   (coerce
	    (let ((dis (with-output-to-string (*standard-output*)
			 (dotimes (i 20)
			   (disassemble (aref *funcs* i))))))
	      (with-input-from-string (s dis)
		(loop for line = (read-line s nil nil)
		      while line
		  collect line)))
	    'vector))

    *DIS-LINES/2*
    cmu> (defparameter *offsets/2*
	   (loop for line across *dis-lines/2*
	     when (search ".ENTRY" line )
	      collect (mod (parse-integer (subseq line 6 8) :radix 16)
			   64)))

    *OFFSETS/2*
    cmu> *offsets/2*

    (16 8 0 40 32 24 16 8 24 0 56 48 40 32 24 16 8 0 56 48)
    cmu> (defparameter *time-lines/2*
	   (coerce
	    (let ((dis (with-output-to-string (*trace-output*)
			 (dotimes (i 20)
			   (time (funcall (aref *funcs* i)))))))
	      (with-input-from-string (s dis)
		(loop for line = (read-line s nil nil)
		      while line
		  collect line)))
	    'vector))
    ...[lots of warnings about "TIME form in a non-null environment"]...
    *TIME-LINES/2*
    cmu> (defparameter *cycles/2*
		       (loop for line across *time-lines/2*
			 when (search "CPU cycles" line)
			   collect (parse-integer line :start 4 :end 5)))

    *CYCLES/2*
    cmu> *cycles/2*

    (6 4 6 4 6 4 6 4 4 6 4 6 4 6 4 6 4 6 4 6)
    cmu> (loop for cycles in *cycles/2*
	       and offset in *offsets/2*
	   collect (list cycles offset))

    ((6 16) (4 8) (6 0) (4 40) (6 32) (4 24) (6 16) (4 8) (4 24) (6 0)
     (4 56) (6 48) (4 40) (6 32) (4 24) (6 16) (4 8) (6 0) (4 56) (6 48))
    cmu> (stable-sort (copy-list *) #'< :key #'first)

    ((4 8) (4 40) (4 24) (4 8) (4 24) (4 56) (4 40) (4 24) (4 8) (4 56)
     (6 16) (6 0) (6 32) (6 16) (6 0) (6 48) (6 32) (6 16) (6 0) (6 48))
    cmu> (stable-sort (copy-list *) #'< :key #'second)

    ((6 0) (6 0) (6 0) (4 8) (4 8) (4 8) (6 16) (6 16) (6 16) (4 24) (4 24)
     (4 24) (6 32) (6 32) (4 40) (4 40) (6 48) (6 48) (4 56) (4 56))
    cmu> 

Here again, function starting offsets of "16*n" are "slow" and offsets
of "(16*n)+8" are "fast", even though the "same" functions have different
offsets/speeds than they did before the relocation (due to the GC).

You can repeat this sequence all day long, and you'll get the same
correlation of speed with alignment, even though for any single function
*its* alignment (and thus speed) may well vary from one GC to the next.

This makes benchmarking Lisp programs with copying garbage collectors
very "interesting", to say the least.  ;-}  ;-}

+---------------
| b) that cache friendliness is preserved accross VM mapping...
+---------------

I'm not sure exactly what you're trying to ask here, but consider
that the inner loop in question is only 11 bytes long, so it either
all fits in a single primary I-cache fetch or it's broken between
exactly two such fetches [given that 32-bit CMUCL always aligns
objects on 8-byte boundaries]. Anyway, since pages consist of an
even number of cache lines, I don't think the "VM mapping" (whatever
you mean by that) matters very much in this particular case.

+---------------
| and c) that no other process on your system was stealing time away,
| 3 runs out of 4 ?
+---------------

There's no way to convince you of that remotely, of course, but I
do know what's running on my systems, and no, there were no other
significant CPU loads during the above TIME runs. By "significant"
I mean loads that would skew the first two decimal digits of the
CPU's cycle count [which CMUCL reads with the x86 "RDTSC" instruction].


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Bakul Shah
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <48B9D8C1.9030008@bitblocks.com>
Rob Warnock wrote:
> CMUCL-19c always allocates objects on 8-byte boundaries, as you can see.
> The cache line size of the Athlon is 64 bytes, so there should be only 
> 8 cases we need to distinguish. Let's match them up against "fast" and
> "slow" execution:
		...
> This is very interesting: When the offset of the start of the function
> is a multiple of 16, the code runs slow; when the offset is 8 *plus*
> a multiple of 16, the code runs fast.
		...
> So functions which were fast just after being compiled stay fast
> upon repeated execution, and functions which were slow stay slow.
> 
> From the disassembly above, we know that the code of the inner loop of
> the DOTIMES is 11 bytes long. There are only two possible alignments
> of interest, "16*n" and "(16*n)+8", but why is the "(16*n)+8" alignment
> the *fast* one?!?
		...
> So for the "fast" cases the inner loop code never crosses a 16-byte
> boundary, and for the "slow" cases it does. But why is a 16-byte
> boundary important, and not the 64-byte cache line size? A possible
> answer for that I dug out of an old copy of Microprocessor Report
> on the AMD K7, which notes that the K7 pipeline fetches 16 bytes of
> instructions from the primary I-cache per cycle. *AHA!* All becomes clear!
> [Yes, I know I'm running on a K8, but I'm guessing it's the same.]
> 
> As further confirmation, if one now forces a GC [CMUCL has a copying GC],
> then all those functions will get moved around, and the alignments will
> all change, and with them the run-times under TIME. But what does *not*
> change is the pairing between the new function alignments and which
> functions now run "fast" or "slow":
	...
> Here again, function starting offsets of "16*n" are "slow" and offsets
> of "(16*n)+8" are "fast", even though the "same" functions have different
> offsets/speeds than they did before the relocation (due to the GC).
> 
> You can repeat this sequence all day long, and you'll get the same
> correlation of speed with alignment, even though for any single function
> *its* alignment (and thus speed) may well vary from one GC to the next.
> 
> This makes benchmarking Lisp programs with copying garbage collectors
> very "interesting", to say the least.  ;-}  ;-}

Looks like you are saying
a) loops should be aligned on a 16 byte boundary
b) A copying GC should maintain function alignment modulo 16
That shouldn't be too hard to fix, right?!
From: Rob Warnock
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <P9KdnRf8PsH2hSfVnZ2dnUVZ_vOdnZ2d@speakeasy.net>
Bakul Shah  <············@bitblocks.com> wrote:
+---------------
| Rob Warnock wrote:
| > CMUCL-19c always allocates objects on 8-byte boundaries, as you can see.
...
| > So for the "fast" cases the inner loop code never crosses a 16-byte
| > boundary, and for the "slow" cases it does. But why is a 16-byte
| > boundary important, and not the 64-byte cache line size? A possible
| > answer for that I dug out of an old copy of Microprocessor Report
| > on the AMD K7, which notes that the K7 pipeline fetches 16 bytes of
| > instructions from the primary I-cache per cycle. *AHA!* All becomes clear!
...
| > ...for any single function
| > *its* alignment (and thus speed) may well vary from one GC to the next.
| 
| Looks like you are saying
| a) loops should be aligned on a 16 byte boundary
| b) A copying GC should maintain function alignment modulo 16
+---------------

Exactly.

+---------------
| That shouldn't be too hard to fix, right?!
+---------------

Well... In theory, yes, but "theory vs practice", you know.  ;-}

You don't want to have to force conses to 16-byte boundaries,
since on a 32-bit machine that would *double* the space taken
by them [though on 64-bit it's a no-brainer], so you'd have
to do some sort of BiBoP-like (or at least segregated) allocation
in the GC. And then there's adding algorithms to the compiler
to know *which* loops are worth to forcing alignment on (inner
two levels only, probably), and then adding data structures to
the compiler to keep track of that where it's needed, and to
glue that information together with the part of the compiler
that stitches generated code together [since the current CMUCL
compiler is mostly template based, viz. "VOPs"]. And then modify
the FASL loader & format to preserve that alignment on loaded
compiled code. Etc.

Certainly "not too hard" for a funded project by a small team
of experienced compiler writers, but I suspect that resource
isn't currently available for CMUCL.

Perhaps someone from one of the commercial implementations
could comment on how they address(ed) this issue... [Duane?]


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Bakul Shah
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <48BA2911.4090206@bitblocks.com>
Rob Warnock wrote:
> Bakul Shah  <············@bitblocks.com> wrote:
> | a) loops should be aligned on a 16 byte boundary
> | b) A copying GC should maintain function alignment modulo 16
> +---------------
> 
> Exactly.
> 
> +---------------
> | That shouldn't be too hard to fix, right?!
> +---------------
> 
> Well... In theory, yes, but "theory vs practice", you know.  ;-}
> 
> You don't want to have to force conses to 16-byte boundaries,
> since on a 32-bit machine that would *double* the space taken
> by them [though on 64-bit it's a no-brainer],

I am lost. What does consing have to do with loops? I am only talking
about aligning loops (as part of some compiled code).
From: Rob Warnock
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <s72dnZxH2JTcGifVnZ2dnUVZ_rTinZ2d@speakeasy.net>
Bakul Shah  <············@bitblocks.com> wrote:
+---------------
| Rob Warnock wrote:
| > Bakul Shah  <············@bitblocks.com> wrote:
| > You don't want to have to force conses to 16-byte boundaries,
| > since on a 32-bit machine that would *double* the space taken
| > by them [though on 64-bit it's a no-brainer],
| 
| I am lost. What does consing have to do with loops? I am only
| talking about aligning loops (as part of some compiled code).
+---------------

And I was talking about needing to *maintain* that alignment
when a copying GC copies such an aligned code block.

But as Thomas Burdick said in a parallel reply, I may have
been trying too hard to save a few bytes when I suggested
partitioning the heap. In fact, when copying a code object,
if you simply allow the GC to waste 8 bytes if it needs to
for maintaining alignment, that should be "good enough".


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Thomas F. Burdick
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <a52f1803-0f68-4ffd-9eba-966742edc600@k30g2000hse.googlegroups.com>
To the point you made in your earlier post, about profiling being
"interesting" with a copying GC ... of course, most of the time,
alignment issues are going to play a very minor role (it's essentially
the noise in macro-optimization); but when I have had micro-
optimization concerns, trying to get the inner loop of an altorithm to
fit into the i-cache, obviously there you do care about alignment. The
way to work around the copying GC is to have your build process result
in a dumped core, with your code in static space. Then you can look to
see if it's aligned as you need it.

On Aug 31, 6:04 am, ····@rpw3.org (Rob Warnock) wrote:
>
> Bakul Shah  <············@bitblocks.com> wrote:
> |
> | Looks like you are saying
> | a) loops should be aligned on a 16 byte boundary
> | b) A copying GC should maintain function alignment modulo 16
> +---------------
>
> Exactly.
>
> +---------------
> | That shouldn't be too hard to fix, right?!
> +---------------
>
> Well... In theory, yes, but "theory vs practice", you know.  ;-}
>
> You don't want to have to force conses to 16-byte boundaries,
> since on a 32-bit machine that would *double* the space taken
> by them [though on 64-bit it's a no-brainer], so you'd have
> to do some sort of BiBoP-like (or at least segregated) allocation
> in the GC. And then there's adding algorithms to the compiler
> to know *which* loops are worth to forcing alignment on (inner
> two levels only, probably), and then adding data structures to
> the compiler to keep track of that where it's needed, and to
> glue that information together with the part of the compiler
> that stitches generated code together [since the current CMUCL
> compiler is mostly template based, viz. "VOPs"]. And then modify
> the FASL loader & format to preserve that alignment on loaded
> compiled code. Etc.

I think you're making this harder than it has to be. It would be
enough if the GC knew that code vectors need to be aligned to some
coarser granularity than other objects. If it needs to, it would waste
an extra 8 bytes when moving a code vector, but that's not a big deal
-- conses would stay unaffected, and it wouldn't be a huge, pervasive
change like partitioning the heap. If you know that code vectors are
always 16-byte aligned, it would be fairly simple to insert noops to
align given bits of the code to certain boundaries. No changes needed
to the FASL loader or anything. The big deal would be adding loop
analysis to the compiler so it could figure out *which* blocks are the
ones where alignment matters.

> Certainly "not too hard" for a funded project by a small team
> of experienced compiler writers, but I suspect that resource
> isn't currently available for CMUCL.

There I agree with you. Adding loop analysis to sbcl has been
threatened in the past, and if someone gets around to it, this is the
sort of thing that could quite easily come out of such an effort.
From: Rob Warnock
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <s72dnZ9H2JT8FSfVnZ2dnUVZ_rTinZ2d@speakeasy.net>
Thomas F. Burdick <········@gmail.com> wrote:
+---------------
| > You don't want to have to force conses to 16-byte boundaries,
| > since on a 32-bit machine that would *double* the space taken
| > by them [though on 64-bit it's a no-brainer], so you'd have
| > to do some sort of BiBoP-like (or at least segregated) allocation
| > in the GC. ...
| 
| I think you're making this harder than it has to be. It would be
| enough if the GC knew that code vectors need to be aligned to some
| coarser granularity than other objects. If it needs to, it would waste
| an extra 8 bytes when moving a code vector, but that's not a big deal
| -- conses would stay unaffected, and it wouldn't be a huge, pervasive
| change like partitioning the heap.
+---------------

*D'Oh!!*  Of course, how silly of me. You're right, of course.
Forcing alignment of just the code blocks would be "good enough".
And given the average size of a code block, the amount of wasted
space would be miniscule on average.

"Never mind..."


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Bakul Shah
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <48BADE31.2000902@bitblocks.com>
Rob Warnock wrote:
> Thomas F. Burdick <········@gmail.com> wrote:
> +---------------
> | > You don't want to have to force conses to 16-byte boundaries,
> | > since on a 32-bit machine that would *double* the space taken
> | > by them [though on 64-bit it's a no-brainer], so you'd have
> | > to do some sort of BiBoP-like (or at least segregated) allocation
> | > in the GC. ...
> | 
> | I think you're making this harder than it has to be. It would be
> | enough if the GC knew that code vectors need to be aligned to some
> | coarser granularity than other objects. If it needs to, it would waste
> | an extra 8 bytes when moving a code vector, but that's not a big deal
> | -- conses would stay unaffected, and it wouldn't be a huge, pervasive
> | change like partitioning the heap.
> +---------------
> 
> *D'Oh!!*  Of course, how silly of me. You're right, of course.
> Forcing alignment of just the code blocks would be "good enough".
> And given the average size of a code block, the amount of wasted
> space would be miniscule on average.
> 
> "Never mind..."

What I was saying originally:
| b) A copying GC should maintain function alignment modulo 16
Probably what confused you was that I (loosely) used `function'
to mean a `compiled function or subroutine'.
From: verec
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <48b9ea40$0$620$5a6aecb4@news.aaisp.net.uk>
On 2008-08-30 18:37:43 +0100, ····@rpw3.org (Rob Warnock) wrote:

[a very detailed explanation of his Lisp on Atlon cache aligment
experiments]

That's certainly more than what I expected, but then begs the
question of how realistic such improved "cached aligned" loops
might ne in the real world:

Given the byzantine number of bytes per opcode requirement of
the x86 ISA, it is very likely that most loops will need more
than 16 bytes of code between the branch-back and the top of
the loop, thus incurring a systematic "16 bytes fetch cycle"
miss each and every time through the loop.

As for the VM mapping bit, I was just saying that virtual
address 48B68C28 (labbelled FAST in your example) might be
mapped to physical address, say xxxx20, that which the CPU
actually fetches from, and thus your 16 bytes alignment that
holds in VM land might just break after translation.

But thanks for the entertaining explanation :-)
--
JFB
From: Rob Warnock
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <WuCdnf7zmaLHgCfVnZ2dnUVZ_qTinZ2d@speakeasy.net>
verec  <·····@mac.com> wrote:
+---------------
| ····@rpw3.org (Rob Warnock) wrote:
| [a very detailed explanation of his Lisp on Atlon cache aligment
| experiments]
| 
| That's certainly more than what I expected, but then begs the
| question of how realistic such improved "cached aligned" loops
| might ne in the real world:
| 
| Given the byzantine number of bytes per opcode requirement of
| the x86 ISA, it is very likely that most loops will need more
| than 16 bytes of code between the branch-back and the top of
| the loop, thus incurring a systematic "16 bytes fetch cycle"
| miss each and every time through the loop.
+---------------

In my conversations with people who *are* experienced compiler
writers [I'm not, just an amateur there], I've been told that
for algorithmic kernels keeping loops 16-byte-aligned is an
easy way to get a small but measurable improvement. I would
expect that on modern x86 (and x86_64) machines that the penalty
for misalignment would be only one or two cycles. Whether that
is important enough to you (generic "you") to affect your buying
decision about which compiler to use is something you'd have
to answer for yourself. But compiler writers *do* worry about
such things.

+---------------
| As for the VM mapping bit, I was just saying that virtual
| address 48B68C28 (labbelled FAST in your example) might be
| mapped to physical address, say xxxx20, that which the CPU
| actually fetches from, and thus your 16 bytes alignment that
| holds in VM land might just break after translation.
+---------------

Oh, got it. No, actually, that *CAN'T* happen!!!  ;-}  ;-}
The page size for x86 machine is typically 4 KiB, so the
low three hex digits of any virtual address *will* remain
unchanged across the vir2phys translation. That is, 48B68C28
will always go to xxxxxC28. And since cache lines are aligned
within pages...

But to answer the unstated related question: yes, there have
been cases historically where the compiler had to pay attention
to page crossings as well as cache line crossings. [E.g., the
infamous early MIPS R4000 bug that showed up if a jump was the
last instruction on a page, "fixed" with a change to the linker!!
The SGI compiler included a "-no_jump_at_eop" option in "ld"
for a while.]


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Duane Rettig
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <o0od33kig6.fsf@gemini.franz.com>
····@rpw3.org (Rob Warnock) writes:

> verec  <·····@mac.com> wrote:
> +---------------
> | ····@rpw3.org (Rob Warnock) wrote:
> | [a very detailed explanation of his Lisp on Atlon cache aligment
> | experiments]
> | 
> | That's certainly more than what I expected, but then begs the
> | question of how realistic such improved "cached aligned" loops
> | might ne in the real world:
> | 
> | Given the byzantine number of bytes per opcode requirement of
> | the x86 ISA, it is very likely that most loops will need more
> | than 16 bytes of code between the branch-back and the top of
> | the loop, thus incurring a systematic "16 bytes fetch cycle"
> | miss each and every time through the loop.
> +---------------
>
> In my conversations with people who *are* experienced compiler
> writers [I'm not, just an amateur there], I've been told that
> for algorithmic kernels keeping loops 16-byte-aligned is an
> easy way to get a small but measurable improvement. I would
> expect that on modern x86 (and x86_64) machines that the penalty
> for misalignment would be only one or two cycles. Whether that
> is important enough to you (generic "you") to affect your buying
> decision about which compiler to use is something you'd have
> to answer for yourself. But compiler writers *do* worry about
> such things.

It's not only about time to load the cache line, but how many you'll
be loading; if a loop runs within a cache line but starts in the
middle of one, then two cache lines are consumed, and that leaves one
less physical line in the cache for swapping out when the LRU takes
over (presumably these two lines are recently used, and so will not be
flushed themselves).  So a misaligned loop makes the whole machine
slower, in a sense, by giving it a smaller available cache.

> But to answer the unstated related question: yes, there have
> been cases historically where the compiler had to pay attention
> to page crossings as well as cache line crossings. [E.g., the
> infamous early MIPS R4000 bug that showed up if a jump was the
> last instruction on a page, "fixed" with a change to the linker!!
> The SGI compiler included a "-no_jump_at_eop" option in "ld"
> for a while.]

We refused to buckle to this kind of hardware bug; it took us a year
to get SGI to even admit that they had a bug (they had cleaimed to
have fixed the bug with software, but that fix counted on the code
being at a fixed location on the page).  And because we had some large
customers behind us, we finally got them to satisfy the bug issue by
noting that the number of people with that particular lot of R4000s
who also were using lisp were in the single-digits, because most of
them had already moved on, so we got them to agree to replace the
chips of anyone who requested such a replacement in a certain way.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: John Thingstad
Subject: Re: How much tuning does regular lisp compilers do?
Date: 
Message-ID: <op.ugpf92m4ut4oq5@pandora.alfanett.no>
P� Sat, 30 Aug 2008 04:56:58 +0200, skrev Rob Warnock <····@rpw3.org>:

> Andreas Davour  <·······@updateLIKE.uu.HELLse> wrote:
> +---------------
> | I've just spent three days in a application tuning seminar, learning
> | everything about cache lines, cache hits, inlining functions, unrolling
> | loops, openMP and more of that kind of techniques. ...
> +---------------
>
> The only thing I can contribute to that is:
>
> 1. Compiled CLs are not always careful about placing functions
>    on optimal cache line boundaries.
>
> 2. Item #1 can make a *significant* difference in runtime.
>

This is a bit more difficult than it sounds. It varies with the  
application and the processor.
Intel's optimizing compilers (Fortran and C++) allow you to run a run-time  
monitoring and successive compilation to optimize cache hit's.
Writing a system for Lisp that could do this in real-time would be a  
interesting challenge..
I imagine a segmented heap could allow you to move related data together  
to fit into the cache for related operations, not just code.
(A PhD assignment perhaps?)

--------------
John Thingstad