From: Alexander Kjeldaas
Subject: Re: Timing for various operations on a PC
Date: 
Message-ID: <dkb654$j57$1@localhost.localdomain>
········@gmail.com wrote:
>     In Peter Norvig's essay, "Teach Yourself Programming in Ten Years",
> he says it is a good idea for anybody interested in CS to find out
> certain details like -- " Know how long it takes your computer to
> execute an instruction, fetch a word from memory (with and without a
> cache miss), read consecutive words from disk, and seek to a new
> location on disk." He gives a table(values from 2001) for his PC.
>     Can anyone tell me how to determine this accurately for a PC?
> Especially things like time taken to fetch something from L1 cache? Or
> is it something we have to look for the in the PC/processor manuals?
>       This group may not be the apropriate place for the question, but
> I have always found good answers here and I feel the topic maybe
> relevant to a programmer of any language.
> 

I will give a slightly different answer than most of the others.  After 
you have acquainted yourself with some literature about the micro 
architecture of modern processors (read the Intel manuals for example), 
you will get a decent picture about timings of different instructions.

However, in actual code, non-predicted branches and cache-misses are 
often what consumes most of the time, so you should use a statistical 
profiler in order to learn the relative speed of the different 
instructions for a given function.  A statistical profiler will use 
built-in hardware or software interrupts to sample where the instruction 
  pointer is at a semi-constant rate.  By running a piece of code lots 
of times, the samples will give a picture of where time is mostly spent.

For SBCL I would recommend the sb-sprof package.  It is beautifully 
integrated with the SBCL disassembler and will give you the number of 
samples that were collected for each instruction.

Example:

decibel:~:(9)$ sbcl
This is SBCL 0.9.6, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.

SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses.  See the CREDITS and COPYING files in the
distribution for more information.
* (defun foo (a b c) (declare ((integer 0 50) a b c)) (+ a (* b c)))

FOO
* (require :sb-sprof)

("SB-SPROF")
* (progn
(sb-sprof:start-profiling)
(dotimes (i 300000000) (foo 2 3 4))
(sb-sprof:stop-profiling))

* (disassemble 'foo)

; 03A9628B:       488BC7           MOV RAX, RDI               ; 
no-arg-parsing entry point
;       8E:       48C1F803         SAR RAX, 3
;       92:       480FAFC6         IMUL RAX, RSI              ; 24/369 
samples
;       96:       488D1403         LEA RDX, [RBX+RAX]         ; 17/369 
samples
;       9A:       488B4DF0         MOV RCX, [RBP-16]
;       9E:       488B45F8         MOV RAX, [RBP-8]
;       A2:       4883C103         ADD RCX, 3                 ; 18/369 
samples
;       A6:       488BE5           MOV RSP, RBP
;       A9:       488BE8           MOV RBP, RAX
;       AC:       FFE1             JMP RCX                    ; 9/369 
samples
;       AE:       90               NOP
;       AF:       90               NOP
;       B0:       CC0A             BREAK 10                   ; error trap
;       B2:       02               BYTE #X02
;       B3:       18               BYTE #X18                  ; 
INVALID-ARG-COUNT-ERROR
;       B4:       CE               BYTE #XCE                  ; RBX
;       B5:       CC0A             BREAK 10                   ; error trap
;       B7:       02               BYTE #X02
;       B8:       18               BYTE #X18                  ; 
INVALID-ARG-COUNT-ERROR
;       B9:       4E               BYTE #X4E                  ; RCX
;
NIL

Here you see that for a lot of the instructions, the sampler never saw 
the instruction pointer at their address.  This does not mean that they 
take zero time, but they are probably executed in the same clock cycle 
as the next instruction.  The CPU will retire several instructions every 
clock cycle.  Be careful about trusting the samples for the JMP - the 
time is often accounted for at the target location (which is not shown 
here).

Now if we change FOO to do some calculations and put the answer in a 
CONS-cell:


* (defun foo (a b c) (declare ((integer 0 50) a b c)) (cons nil (+ a (* 
b c))))
STYLE-WARNING: redefining FOO in DEFUN

FOO
* (progn
(sb-sprof:start-profiling)
(dotimes (i 300000000) (foo 2 3 4))
(sb-sprof:stop-profiling))

* (disassemble 'foo)

; 02BB1B3B:       488BC7           MOV RAX, RDI               ; 
no-arg-parsing entry point
;       3E:       48C1F803         SAR RAX, 3
;       42:       480FAFC6         IMUL RAX, RSI              ; 17/1949 
samples
;       46:       488D1403         LEA RDX, [RBX+RAX]         ; 15/1949 
samples
;       4A:       41C684249800000000 MOV BYTE PTR [R12+152], 0
                                           (CLEAR INTERRUPT FLAG?)
;       53:       41C684249000000008 MOV BYTE PTR [R12+144], 8
                                           (INDICATE UNINTERRUPTIBLE)
;       5C:       48C7C110000000   MOV RCX, 16
;       63:       49034C2440       ADD RCX, [R12+64]
;       68:       49394C2448       CMP [R12+72], RCX
;       6D:       765B             JBE L2
                                           (CALCULATE NEW END-OF-HEAP
                                            AND SEE IF WE NEED TO
                                            ALLOCATE MOVE MEMORY)
;       6F:       49874C2440       XCHG RCX, [R12+64]
                                           (GET ALLOCATED MEMORY/
                                            UPDATE END-OF-HEAP)
;       74: L0:   488D4907         LEA RCX, [RCX+7]
;       78:       41C684249000000000 MOV BYTE PTR [R12+144], 0
;       81:       4180BC249800000000 CMP BYTE PTR [R12+152], 0
;       8A:       7402             JEQ L1
;       8C:       CC09             BREAK 9                    ; pending 
interrupt trap
                                            (INDICATE INTERRUPTIBLE/
                                             HANDLE INTERRUPTS)
;       8E: L1:   48B81700004000000000 MOV RAX, 1073741847    ; 509/1949 
samples
;       98:       488941F9         MOV [RCX-7], RAX           ; 510/1949 
samples
;       9C:       48895101         MOV [RCX+1], RDX           ; 105/1949 
samples
;       A0:       488BD1           MOV RDX, RCX               ; 8/1949 
samples
;       A3:       488B4DF0         MOV RCX, [RBP-16]          ; 13/1949 
samples
;       A7:       488B45F8         MOV RAX, [RBP-8]
;       AB:       4883C103         ADD RCX, 3
;       AF:       488BE5           MOV RSP, RBP               ; 23/1949 
samples
;       B2:       488BE8           MOV RBP, RAX
;       B5:       FFE1             JMP RCX
;       B7:       90               NOP
;       B8:       90               NOP
;       B9:       90               NOP
;       BA:       90               NOP
;       BB:       90               NOP
;       BC:       90               NOP
;       BD:       90               NOP
;       BE:       90               NOP
;       BF:       90               NOP
;       C0:       CC0A             BREAK 10                   ; error trap
;       C2:       02               BYTE #X02
;       C3:       18               BYTE #X18                  ; 
INVALID-ARG-COUNT-ERROR
;       C4:       CE               BYTE #XCE                  ; RBX
;       C5:       CC0A             BREAK 10                   ; error trap
;       C7:       02               BYTE #X02
;       C8:       18               BYTE #X18                  ; 
INVALID-ARG-COUNT-ERROR
;       C9:       4E               BYTE #X4E                  ; RCX
;       CA: L2:   6A10             PUSH 16
;       CC:       4C8D2C2560AA4100 LEA R13, [#x41AA60]        ; alloc_tramp
;       D4:       41FFD5           CALL R13
;       D7:       59               POP RCX
;       D8:       EB9A             JMP L0
;
NIL

Here we see that there is a whole set of instructions that have no 
samples.  These are not sampled because interrupts are turned off (I 
think?) and are accounted for by the large sample count for the MOV at 
L1.  The sample for that MOV instruction should therefore be taken as 
the total over the whole allocation region.

However, the large sample counts for the MOVs at address 98 and 9C are 
very real.  They represent cache-misses for newly allocated CONS cells! 
  Usually the first MOV will take the hit, but some times a CONS cell 
might cross a cache-line and both the first and the second MOV will 
stall in this case.

Actually, since each cache-line is 64 bytes on the Athlon 64/Opteron and 
a CONS cell is 16 bytes, we would expect a CONS-cell to cross a 
cache-line boundary one out of 4 times.  The 5:1 ratio between the 98/9C 
MOVs supports the theory that most time is spent waiting for memory.

So what do you lean from this?

You learn that a multiplication instruction is about 60 times cheaper 
than allocating and populating a CONS cell in SBCL (17 vs 509+510+105 
samples and that is not counting GC time spent reclaiming the garbage).

You learn that the amortized time to to lots of MOVs, CMP etc to 
allocate a CONS cell is less than the time spent waiting to access the 
memory that was allocated (509 vs 510+105 cycles).  This is because all 
the memory accessed during allocation is cached, while the allocated 
memory is usually not cached.

You learn that the multiplication is slightly slower than addition, but 
that it probably does not matter and that these operations will be 
swamped by other stuff unless you concentrate on the inner loop (24 vs 
17 samples).

This is all handy to know.


astor

From: Juho Snellman
Subject: Re: Timing for various operations on a PC
Date: 
Message-ID: <slrndmibvu.rnl.jsnell@dogbert.localdomain>
<··········@fast.no> wrote:
> * (progn
> (sb-sprof:start-profiling)
> (dotimes (i 300000000) (foo 2 3 4))
> (sb-sprof:stop-profiling))

SB-SPROF:WITH-PROFILING might be more appropriate than explicit 
START/STOP-PROFILING calls in this case.

> ;       6F:       49874C2440       XCHG RCX, [R12+64]
>                                            (GET ALLOCATED MEMORY/
>                                             UPDATE END-OF-HEAP)

"Oops". That XCHG is a remnant of the early days of the x86-64 port,
when no temporary register was usable here. A register/memory XCHG has
a very high latency, so consing could be sped up significantly by
rearranging the code a bit. Based on a quick hack, it looks like your
example program would get about 30% faster.

The moral of the story: even if a statistical profiler is available, 
it sometimes pays to read the instruction timings from the
processor's documentation :-)

> Actually, since each cache-line is 64 bytes on the Athlon 64/Opteron and 
> a CONS cell is 16 bytes, we would expect a CONS-cell to cross a 
> cache-line boundary one out of 4 times.  The 5:1 ratio between the 98/9C 
> MOVs supports the theory that most time is spent waiting for memory.

Good theory, but consed memory is always 16-byte aligned on 64-bit 
SBCL.

-- 
Juho Snellman
From: ········@gmail.com
Subject: Re: Timing for various operations on a PC
Date: 
Message-ID: <1130974598.691984.98060@g14g2000cwa.googlegroups.com>
Thanks everyone for the great replies and discussions. This is cool :-)

-- Aravindh