········@gmail.com wrote:
> In Peter Norvig's essay, "Teach Yourself Programming in Ten Years",
> he says it is a good idea for anybody interested in CS to find out
> certain details like -- " Know how long it takes your computer to
> execute an instruction, fetch a word from memory (with and without a
> cache miss), read consecutive words from disk, and seek to a new
> location on disk." He gives a table(values from 2001) for his PC.
> Can anyone tell me how to determine this accurately for a PC?
> Especially things like time taken to fetch something from L1 cache? Or
> is it something we have to look for the in the PC/processor manuals?
> This group may not be the apropriate place for the question, but
> I have always found good answers here and I feel the topic maybe
> relevant to a programmer of any language.
>
I will give a slightly different answer than most of the others. After
you have acquainted yourself with some literature about the micro
architecture of modern processors (read the Intel manuals for example),
you will get a decent picture about timings of different instructions.
However, in actual code, non-predicted branches and cache-misses are
often what consumes most of the time, so you should use a statistical
profiler in order to learn the relative speed of the different
instructions for a given function. A statistical profiler will use
built-in hardware or software interrupts to sample where the instruction
pointer is at a semi-constant rate. By running a piece of code lots
of times, the samples will give a picture of where time is mostly spent.
For SBCL I would recommend the sb-sprof package. It is beautifully
integrated with the SBCL disassembler and will give you the number of
samples that were collected for each instruction.
Example:
decibel:~:(9)$ sbcl
This is SBCL 0.9.6, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.
SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses. See the CREDITS and COPYING files in the
distribution for more information.
* (defun foo (a b c) (declare ((integer 0 50) a b c)) (+ a (* b c)))
FOO
* (require :sb-sprof)
("SB-SPROF")
* (progn
(sb-sprof:start-profiling)
(dotimes (i 300000000) (foo 2 3 4))
(sb-sprof:stop-profiling))
* (disassemble 'foo)
; 03A9628B: 488BC7 MOV RAX, RDI ;
no-arg-parsing entry point
; 8E: 48C1F803 SAR RAX, 3
; 92: 480FAFC6 IMUL RAX, RSI ; 24/369
samples
; 96: 488D1403 LEA RDX, [RBX+RAX] ; 17/369
samples
; 9A: 488B4DF0 MOV RCX, [RBP-16]
; 9E: 488B45F8 MOV RAX, [RBP-8]
; A2: 4883C103 ADD RCX, 3 ; 18/369
samples
; A6: 488BE5 MOV RSP, RBP
; A9: 488BE8 MOV RBP, RAX
; AC: FFE1 JMP RCX ; 9/369
samples
; AE: 90 NOP
; AF: 90 NOP
; B0: CC0A BREAK 10 ; error trap
; B2: 02 BYTE #X02
; B3: 18 BYTE #X18 ;
INVALID-ARG-COUNT-ERROR
; B4: CE BYTE #XCE ; RBX
; B5: CC0A BREAK 10 ; error trap
; B7: 02 BYTE #X02
; B8: 18 BYTE #X18 ;
INVALID-ARG-COUNT-ERROR
; B9: 4E BYTE #X4E ; RCX
;
NIL
Here you see that for a lot of the instructions, the sampler never saw
the instruction pointer at their address. This does not mean that they
take zero time, but they are probably executed in the same clock cycle
as the next instruction. The CPU will retire several instructions every
clock cycle. Be careful about trusting the samples for the JMP - the
time is often accounted for at the target location (which is not shown
here).
Now if we change FOO to do some calculations and put the answer in a
CONS-cell:
* (defun foo (a b c) (declare ((integer 0 50) a b c)) (cons nil (+ a (*
b c))))
STYLE-WARNING: redefining FOO in DEFUN
FOO
* (progn
(sb-sprof:start-profiling)
(dotimes (i 300000000) (foo 2 3 4))
(sb-sprof:stop-profiling))
* (disassemble 'foo)
; 02BB1B3B: 488BC7 MOV RAX, RDI ;
no-arg-parsing entry point
; 3E: 48C1F803 SAR RAX, 3
; 42: 480FAFC6 IMUL RAX, RSI ; 17/1949
samples
; 46: 488D1403 LEA RDX, [RBX+RAX] ; 15/1949
samples
; 4A: 41C684249800000000 MOV BYTE PTR [R12+152], 0
(CLEAR INTERRUPT FLAG?)
; 53: 41C684249000000008 MOV BYTE PTR [R12+144], 8
(INDICATE UNINTERRUPTIBLE)
; 5C: 48C7C110000000 MOV RCX, 16
; 63: 49034C2440 ADD RCX, [R12+64]
; 68: 49394C2448 CMP [R12+72], RCX
; 6D: 765B JBE L2
(CALCULATE NEW END-OF-HEAP
AND SEE IF WE NEED TO
ALLOCATE MOVE MEMORY)
; 6F: 49874C2440 XCHG RCX, [R12+64]
(GET ALLOCATED MEMORY/
UPDATE END-OF-HEAP)
; 74: L0: 488D4907 LEA RCX, [RCX+7]
; 78: 41C684249000000000 MOV BYTE PTR [R12+144], 0
; 81: 4180BC249800000000 CMP BYTE PTR [R12+152], 0
; 8A: 7402 JEQ L1
; 8C: CC09 BREAK 9 ; pending
interrupt trap
(INDICATE INTERRUPTIBLE/
HANDLE INTERRUPTS)
; 8E: L1: 48B81700004000000000 MOV RAX, 1073741847 ; 509/1949
samples
; 98: 488941F9 MOV [RCX-7], RAX ; 510/1949
samples
; 9C: 48895101 MOV [RCX+1], RDX ; 105/1949
samples
; A0: 488BD1 MOV RDX, RCX ; 8/1949
samples
; A3: 488B4DF0 MOV RCX, [RBP-16] ; 13/1949
samples
; A7: 488B45F8 MOV RAX, [RBP-8]
; AB: 4883C103 ADD RCX, 3
; AF: 488BE5 MOV RSP, RBP ; 23/1949
samples
; B2: 488BE8 MOV RBP, RAX
; B5: FFE1 JMP RCX
; B7: 90 NOP
; B8: 90 NOP
; B9: 90 NOP
; BA: 90 NOP
; BB: 90 NOP
; BC: 90 NOP
; BD: 90 NOP
; BE: 90 NOP
; BF: 90 NOP
; C0: CC0A BREAK 10 ; error trap
; C2: 02 BYTE #X02
; C3: 18 BYTE #X18 ;
INVALID-ARG-COUNT-ERROR
; C4: CE BYTE #XCE ; RBX
; C5: CC0A BREAK 10 ; error trap
; C7: 02 BYTE #X02
; C8: 18 BYTE #X18 ;
INVALID-ARG-COUNT-ERROR
; C9: 4E BYTE #X4E ; RCX
; CA: L2: 6A10 PUSH 16
; CC: 4C8D2C2560AA4100 LEA R13, [#x41AA60] ; alloc_tramp
; D4: 41FFD5 CALL R13
; D7: 59 POP RCX
; D8: EB9A JMP L0
;
NIL
Here we see that there is a whole set of instructions that have no
samples. These are not sampled because interrupts are turned off (I
think?) and are accounted for by the large sample count for the MOV at
L1. The sample for that MOV instruction should therefore be taken as
the total over the whole allocation region.
However, the large sample counts for the MOVs at address 98 and 9C are
very real. They represent cache-misses for newly allocated CONS cells!
Usually the first MOV will take the hit, but some times a CONS cell
might cross a cache-line and both the first and the second MOV will
stall in this case.
Actually, since each cache-line is 64 bytes on the Athlon 64/Opteron and
a CONS cell is 16 bytes, we would expect a CONS-cell to cross a
cache-line boundary one out of 4 times. The 5:1 ratio between the 98/9C
MOVs supports the theory that most time is spent waiting for memory.
So what do you lean from this?
You learn that a multiplication instruction is about 60 times cheaper
than allocating and populating a CONS cell in SBCL (17 vs 509+510+105
samples and that is not counting GC time spent reclaiming the garbage).
You learn that the amortized time to to lots of MOVs, CMP etc to
allocate a CONS cell is less than the time spent waiting to access the
memory that was allocated (509 vs 510+105 cycles). This is because all
the memory accessed during allocation is cached, while the allocated
memory is usually not cached.
You learn that the multiplication is slightly slower than addition, but
that it probably does not matter and that these operations will be
swamped by other stuff unless you concentrate on the inner loop (24 vs
17 samples).
This is all handy to know.
astor
<··········@fast.no> wrote:
> * (progn
> (sb-sprof:start-profiling)
> (dotimes (i 300000000) (foo 2 3 4))
> (sb-sprof:stop-profiling))
SB-SPROF:WITH-PROFILING might be more appropriate than explicit
START/STOP-PROFILING calls in this case.
> ; 6F: 49874C2440 XCHG RCX, [R12+64]
> (GET ALLOCATED MEMORY/
> UPDATE END-OF-HEAP)
"Oops". That XCHG is a remnant of the early days of the x86-64 port,
when no temporary register was usable here. A register/memory XCHG has
a very high latency, so consing could be sped up significantly by
rearranging the code a bit. Based on a quick hack, it looks like your
example program would get about 30% faster.
The moral of the story: even if a statistical profiler is available,
it sometimes pays to read the instruction timings from the
processor's documentation :-)
> Actually, since each cache-line is 64 bytes on the Athlon 64/Opteron and
> a CONS cell is 16 bytes, we would expect a CONS-cell to cross a
> cache-line boundary one out of 4 times. The 5:1 ratio between the 98/9C
> MOVs supports the theory that most time is spent waiting for memory.
Good theory, but consed memory is always 16-byte aligned on 64-bit
SBCL.
--
Juho Snellman