I have noticed that even though Lispworks compiled code is very fast in tight
loops, it seems very slow when it involves memory allocation. It seems to
take about 500 CPU cycles per cons cell, including whatever time that adds to
garbage collection. In C++ I can allocate and free a memory cell by taking
it off a "free list" and putting it back on, which adds up to just a few CPU
instructions. I'm confused about why Lispworks takes so long. What is it
doing? Is it the garbage collection that takes most of the time?
From: Stephen J. Bevan
Subject: Re: Speed of memory management in Lispworks
Date:
Message-ID: <m366eeg55t.fsf@yahoo.com>
··········@questions.com writes:
> I have noticed that even though Lispworks compiled code is very fast in tight
> loops, it seems very slow when it involves memory allocation. It seems to
> take about 500 CPU cycles per cons cell, including whatever time that adds to
> garbage collection. In C++ I can allocate and free a memory cell by taking
> it off a "free list" and putting it back on, which adds up to just a few CPU
> instructions. I'm confused about why Lispworks takes so long. What is it
> doing? Is it the garbage collection that takes most of the time?
I don't know anything about LispWorks memory management but I can say
something about its stablemate MLWorks. On a SPARC it would take 3
instructions to allocate space for a cons. IIRC it was somewhere
between 9 and 15 on an ix86 (I've forgotten the exact sequences).
Consequently I would expect LispWorks to have similar numbers.
How did you arrive at the 500 cycles measurement?
On Sat, 02 Jun 2001 17:31:11 GMT, ·············@yahoo.com (Stephen J. Bevan)
wrote:
>How did you arrive at the 500 cycles measurement?
From a tight loop that repeatedly makes a short list of small integers and
assigns it to the same variable, thereby abandoning the previous short list.
I no longer have the exact code, because the free version of Lispworks exits
after a few hours. If you have a better test, and don't have Lispworks
installed, you can post it and I will run it and post the results. My
calculations took into account how many cons cells my short list had, how
many iterations, and the speed of my CPU. A tight Lispworks loop without any
consing can take as little as 4 to 5 machine cycles per iteration, which
implies that most of the time consumed is related to the consing.
··········@questions.com writes:
> On Sat, 02 Jun 2001 17:31:11 GMT, ·············@yahoo.com (Stephen J. Bevan)
> wrote:
>
> >How did you arrive at the 500 cycles measurement?
>
> From a tight loop that repeatedly makes a short list of small integers and
> assigns it to the same variable, thereby abandoning the previous short list.
> I no longer have the exact code, because the free version of Lispworks exits
> after a few hours. If you have a better test, and don't have Lispworks
> installed, you can post it and I will run it and post the results.
Just wondering if, for example, you disabled multiprocessing so that you
weren't timing process switches and other background processes.
You should always do mp:without-preemption around such things. It's
possible that on a tight loop that does nothing, you go fast enough that
you get no process switches, for example, but that if you are consing
either the fact of some function call or the fact of a little longer run
causes a process switch.
> My
> calculations took into account how many cons cells my short list had, how
> many iterations, and the speed of my CPU. A tight Lispworks loop without any
> consing can take as little as 4 to 5 machine cycles per iteration, which
> implies that most of the time consumed is related to the consing.
This wasn't really the part I was caring about, though I can see why you'd
think I was. It is important to know this as well, though, in order to have
any useful discussion. One should always talk about the speed of specific
code, not the speed of code in general, I think. Different things are often
differenc speeds. That's why there is no one single benchmark that works
for all purposes.
On Sat, 2 Jun 2001 19:05:40 GMT, Kent M Pitman <······@world.std.com> wrote:
>You should always do mp:without-preemption around such things. It's
Ok, I did that, and it made it 8.8% faster. I will always do that from now
on. But in any case, my main question still stands. What is Lispworks doing
that takes so much time? I'm not trying to fine-tune the speed, but just to
understand what it could be doing during all that time. Multiprocessing was
a factor I should have thought of, but it turns out not to be a big factor,
so there must be some other big factor I don't know about.
··········@questions.com writes:
> On Sat, 2 Jun 2001 19:05:40 GMT, Kent M Pitman <······@world.std.com> wrote:
>
> >You should always do mp:without-preemption around such things. It's
>
> Ok, I did that, and it made it 8.8% faster. I will always do that from now
> on.
I don't think you should *always* do that.
> But in any case, my main question still stands. What is Lispworks doing
> that takes so much time?
Have you looked at disassembly?
--
Janis Dzerins
If million people say a stupid thing it's still a stupid thing.
From: Stephen J. Bevan
Subject: Re: Speed of memory management in Lispworks
Date:
Message-ID: <m3zobqe66q.fsf@yahoo.com>
··········@questions.com writes:
> On Sat, 02 Jun 2001 17:31:11 GMT, ·············@yahoo.com (Stephen J. Bevan)
> >How did you arrive at the 500 cycles measurement?
>
> From a tight loop that repeatedly makes a short list of small integers and
> assigns it to the same variable, thereby abandoning the previous short list.
> I no longer have the exact code, because the free version of Lispworks exits
> after a few hours. If you have a better test, and don't have
> Lispworks ...
My better test, such that it is, is to try and coax LispWorks to
display the assembly for a function that uses cons so that you can see
exactly what instructions are used -- that's how you would find out
the sequences using MLWorks. I've no idea if LispWorks in general, or
the particular version you are using, will let you do this though.
··········@questions.com writes:
> I have noticed that even though Lispworks compiled code is very fast in tight
> loops, it seems very slow when it involves memory allocation. It seems to
> take about 500 CPU cycles per cons cell, including whatever time that adds to
> garbage collection. In C++ I can allocate and free a memory cell by taking
> it off a "free list" and putting it back on, which adds up to just a few CPU
> instructions. I'm confused about why Lispworks takes so long. What is it
> doing? Is it the garbage collection that takes most of the time?
I have no idea if you're right or not, but the first order of business
would be for you to say precisely how you arrived at the 500 CPU cycle
number.
On Sat, 2 Jun 2001 17:58:32 GMT, Kent M Pitman <······@world.std.com> wrote:
>I have no idea if you're right or not, but the first order of business
>would be for you to say precisely how you arrived at the 500 CPU cycle
>number.
The 500 was from using list to allocate the cons cells. When I use the
following code, it's reduced by a factor of 2. But it's still an order of
magnitude more time-consuming than it seems like it should be.
Here are two versions of the loop, with and without the cons:
(defvar *X*)
(defun test1 (n) (dotimes (j n) (dotimes (i 1000) (setq *X* 0))))
(defun test2 (n) (dotimes (j n) (dotimes (i 1000) (setq *X* (cons 1 2)))))
(compile 'test1)
(compile 'test2)
Why not run them yourself and see if you get the same results? The one
without the cons runs in a small fraction of the time, making it clear that
the time-consumption is related to the cons.
From: Christopher J. Vogt
Subject: Re: Speed of memory management in Lispworks
Date:
Message-ID: <3B1961DE.7C518D57@computer.org>
··········@questions.com wrote:
>
> On Sat, 2 Jun 2001 17:58:32 GMT, Kent M Pitman <······@world.std.com> wrote:
>
> >I have no idea if you're right or not, but the first order of business
> >would be for you to say precisely how you arrived at the 500 CPU cycle
> >number.
>
> The 500 was from using list to allocate the cons cells. When I use the
> following code, it's reduced by a factor of 2. But it's still an order of
> magnitude more time-consuming than it seems like it should be.
>
> Here are two versions of the loop, with and without the cons:
>
> (defvar *X*)
> (defun test1 (n) (dotimes (j n) (dotimes (i 1000) (setq *X* 0))))
> (defun test2 (n) (dotimes (j n) (dotimes (i 1000) (setq *X* (cons 1 2)))))
> (compile 'test1)
> (compile 'test2)
>
> Why not run them yourself and see if you get the same results? The one
> without the cons runs in a small fraction of the time, making it clear that
> the time-consumption is related to the cons.
To determine much of anything, you'd really need to run the same sort of code
on the same machine written in C or some other language for comparison. Maybe
your machine is paging, maybe it is some cache issue. It's easy to say what
it *should* be in C, but it is equally easy to say what it should be in Lisp,
and just as easy to be wrong. Performance measuring and understanding is
generally much more difficult than most people think.
From: Bulent Murtezaoglu
Subject: Re: Speed of memory management in Lispworks
Date:
Message-ID: <87wv6uh2vl.fsf@nkapi.internal>
>>>>> "CJV" == Christopher J Vogt <····@computer.org> writes:
[...]
CJV> To determine much of anything, you'd really need to run the
CJV> same sort of code on the same machine written in C or some
CJV> other language for comparison. Maybe your machine is paging,
CJV> maybe it is some cache issue. It's easy to say what it
CJV> *should* be in C, but it is equally easy to say what it
CJV> should be in Lisp, and just as easy to be wrong. [...]
All good advice. Might I add that we *DO* have a disassemble facility
in CL? It is not too hard to notice that there's a type check or
register allocation issue by glancing at the at the disassembly
output. Compliers usually warn about doing type checks, but some
remain silent if there wasn't an optimizer to try.
cheers,
BM
··········@questions.com writes:
In C++ I can allocate and free a memory cell by taking
> it off a "free list" and putting it back on, which adds up to just a few CPU
> instructions.
Lisp can have similar allocation and deallocation semantics when using
resourced storage (pools).
··········@questions.com writes:
> In C++ I can allocate and free a memory cell by taking it off a
> "free list" and putting it back on, which adds up to just a few CPU
> instructions.
If you insist that a general malloc/free or new/delete implementation
will always work with just a few CPU instructions, you need to study
Paul Wilson's review of memory allocators
(ftp://ftp.cs.utexas.edu/pub/garbage/allocsrv.ps).
But yes, your test generates a lot of garbage, and gc will take some
time. Allocation in modern gc implementations is generally faster than
with malloc implementations. In the extreme case, it is an increase of
a pointer + a check whether the pointer has gone out of the range of
the current generation.
--
Lieven Marchand <···@wyrd.be>
Making laws appears to win votes. Enforcing them doesn't.
See Rule One. Roger Burton West in the monastery.
··········@questions.com writes:
> I have noticed that even though Lispworks compiled code is very fast in tight
> loops, it seems very slow when it involves memory allocation. It seems to
> take about 500 CPU cycles per cons cell, including whatever time that adds to
> garbage collection. In C++ I can allocate and free a memory cell by taking
> it off a "free list" and putting it back on, which adds up to just a few CPU
> instructions. I'm confused about why Lispworks takes so long. What is it
> doing? Is it the garbage collection that takes most of the time?
You are probably comparing apples and oranges there.
Let's take a look at the following function:
(defun time-cons (x)
(declare (optimize (speed 3) (safety 0) (space 0) (debug 0))
(fixnum x))
(let ((res nil))
(dotimes (i x)
(declare (fixnum i))
(dotimes (j 1000)
(declare (fixnum j))
(setq res (cons i j))))
res))
When we time this function on 10000 outer loops (thereby consing 10M
conses, or around 80MB of space), LispWorks takes around 2.9 seconds
on an AMD K6-2/550. That comes out at around 290ns or less than 160
cycles per iteration.
Now note that the code is doing more than just allocating the conses,
it is also initializing both the car and cdr of those to some fixnum.
To determine the amount of time this takes, we compare the runtime
with the following function:
(defun time-set-cons (x)
(declare (optimize (speed 3) (safety 0) (space 0) (debug 0))
(fixnum x))
(let ((res (cons 0 0)))
(dotimes (i x)
(declare (fixnum i))
(dotimes (j 1000)
(declare (fixnum j))
(rplaca res i)
(rplacd res j)))
res))
This takes 1.1 seconds for 10000 outer loops, or around 110ns of the
290ns we determined earlier. In reality this influence is likely to
be _greater_, though, since the memory changed in time-set-cons always
is in the 1st level cache, whereas we modify a bounded set of changing
conses in time-cons, and it is likely that the infirmary generation is
larger than the 1st level cache.
Experimentation using arrays of conses to simulate non-locality
effects on the cache take from 1.3s (1000 conses, i.e. a little over
12KB working set, fully inside the 32KB 1st level data-cache), through
1.5s (10000 conses, i.e. 80KB WS inside 2nd level data-cache), to 3.0s
(100000 conses, i.e. 800KB WS outside 2nd level data-cache).
So we likely will have to explain only less than 180ns or less than
100 cycles for the allocation and reclamation of each cons.
If we now take a look at the GC information that LW's extended-time
macro will give us, we can see that we spend nearly 1.3s of 3.1s
runtime in around 420 mark-and-sweep collections of the infirmary
generation. That seems a tad too long, and I'm certain that tuning
LWWs GC parameters would reduce that amount of time.
What remains is between 10ns and 50ns or 6-27 cycles per cons
allocated, which seems reasonable, given that allocation takes little
time in a well designed system.
On CMU CL, which hasn't the best of generational GCs, we get the
following numbers:
Real User Sys GC
TIME-SET-CONS: 50 50 0 0
TIME-CONS, 128KB infirmary: 8640 7810 820 5290
TIME-CONS, default 2MB infirmary: 4160 3430 730 940
TIME-CONS, 20MB infirmary: 3440 2460 970 70
The system time is mostly because of ZERO page initialization of
memory, which CMU CL does by unmapping and remapping of pages on
Linux. This hurts performance a bit in this instance, because we
initialize the cons anyway.
As we can see, large infirmary generations can help in the case of
massive ephemeral garbage creation, because of savings in the overhead
of collections, when collections are costly.
To show how much code is needed to implement the non-GC part of
TIME-CONS, here's the disassembly of it on CMU CL (which provided the
easiest to follow disassembly):
480106F0: .ENTRY TIME-CONS() ; FUNCTION
708: POP DWORD PTR [EBP-8]
70B: LEA ESP, [EBP-32]
70E: MOV EAX, 671088651 ; No-arg-parsing entry point
713: XOR ECX, ECX
715: JMP L5
717: L0: XOR EBX, EBX
719: JMP L4
71B: L1: MOV BYTE PTR [#x280001D4], 0
722: MOV BYTE PTR [#x280001BC], 4
729: MOV EAX, 8
72E: ADD EAX, [#x8069724] ; current_region_free_pointer
734: CMP EAX, [#x80696F8] ; current_region_end_addr
73A: JBE L2
73C: CALL #x80536F8 ; alloc_overflow_eax
741: L2: XCHG EAX, [#x8069724] ; current_region_free_pointer
747: LEA EAX, [EAX+3]
74A: MOV BYTE PTR [#x280001BC], 0
751: CMP BYTE PTR [#x280001D4], 0
758: JEQ L3
75A: BREAK 9 ; Pending interrupt trap
75C: L3: MOV [EAX-3], ECX
75F: MOV [EAX+1], EBX
762: ADD EBX, 4
765: L4: CMP EBX, 4000
76B: JL L1
76D: ADD ECX, 4
770: L5: CMP ECX, EDX
772: JL L0
774: MOV EDX, EAX
776: MOV ECX, [EBP-8]
779: MOV EAX, [EBP-4]
77C: ADD ECX, 2
77F: MOV ESP, EBP
781: MOV EBP, EAX
783: JMP ECX
785: NOP
786: NOP
787: NOP
In the above the BYTE PTR stuff
71B: L1: MOV BYTE PTR [#x280001D4], 0
722: MOV BYTE PTR [#x280001BC], 4
and
74A: MOV BYTE PTR [#x280001BC], 0
751: CMP BYTE PTR [#x280001D4], 0
758: JEQ L3
75A: BREAK 9 ; Pending interrupt trap
is interspersed interrupt handling code and can be ignored.
The inner and outer loops are pretty trivial, and the allocation stuff
is just the following:
729: MOV EAX, 8
72E: ADD EAX, [#x8069724] ; current_region_free_pointer
We want to allocate 8 bytes and calculate the new value of the free
pointer after allocating 8 bytes.
734: CMP EAX, [#x80696F8] ; current_region_end_addr
73A: JBE L2
73C: CALL #x80536F8 ; alloc_overflow_eax
We check if this was more than was available. If yes then we have to
call the overflow handler, which will start GC, and/or switch to
another region, etc. In the normal case we just fall through:
741: L2: XCHG EAX, [#x8069724] ; current_region_free_pointer
We write back the modified freepointer, and retrieve the old free
pointer, which is the pointer to our allocated cons cell, modulo
pointer tag handling done by:
747: LEA EAX, [EAX+3]
Done. After the interrupt handling code, we see just the
initialiazation of our cons (set car and cdr to i and j):
75C: L3: MOV [EAX-3], ECX
75F: MOV [EAX+1], EBX
And the rest of the inner and outer loops.
Running similar benchmarks on ACL we see that we can get as low as
1.2s/65 cycles total time (i.e. allocation + initialization + gc +
loop overhead) per cons, suggesting that that implementation's GC
default parameters are better suited for this scenario, in that the
infirmary GC scheme is fast enough to be called very often with little
overhead, and hence the infirmary generation fits into the 2nd level
cache, making memory accesses quite a bit faster.
Now back to your comparison with C++:
- Note first that on any modern CPU, any instruction that touches
memory is going to take many cycles, especially if it has to touch
normal memory (even on a 550 MHz machine, any instruction that
touches PC100 SDRAM will take more than 10ns, and that doesn't
include secondary effects from pipe-line stalls, etc.).
- The correct comparisson to C/C++ would be to allocate and deallocate
the cons cell using malloc/free or new/delete respectively. On said
AMD, the C version of time-cons takes 5s for 10M conses, the C++
version takes 6.8s.
- If you allow optimizations based on the knowledge that the conses
that are allocated become garbage at once, then you can get faster
C/C++ performance, but you can also gain much faster performance
from the CL version, as seen by time-set-cons, which on CMU CL
performs as fast as the C variant of the same thing. Ditto for
manual pooling of allocation requests, etc.
- Note that normal allocation patterns have much more intricate
life-cycle patterns, thereby making trivial manual allocation
schemes impossible, yielding even higher differences in performance
between automatic and manual storage reclamation schemes[1].
All in all, this example can give some interesting food for thought
for those interested in the inner workings of GC and implementations,
though it teaches us little about real-world performance levels.
Regs, Pierre.
Footnotes:
[1] With thanks to Duane Rettig...
--
Pierre R. Mai <····@acm.org> http://www.pmsf.de/pmai/
The most likely way for the world to be destroyed, most experts agree,
is by accident. That's where we come in; we're computer professionals.
We cause accidents. -- Nathaniel Borenstein
Pierre R. Mai <····@acm.org> wrote:
+---------------
| ...it is likely that the infirmary generation is larger than
| the 1st level cache.
+---------------
Do you really mean "infirmary" here? Doubtless this is a cross-language
difference in slang, but I'm curious as to the connotations of the word
"infirmary" in your native tongue. In English, an "infirmary" is a hospital
or clinic [a place to take care of sick people].
In the GC literature I've read, the slang term I've most often seen
used for a fixed-location ephemeral allocation area is "nursery" --
in English, a place where infant children (or plants) are cared for
(or grown), which they eventually leave (or are transplanted from),
never to return.
-Rob
p.s. In the U.S., the sequence of educational insitutions is typically
thus: nursery school (or "day care", if there's no explicit educational
component) -> pre-school (a.k.a. kindergarten) -> grammar school -> middle
school (only in a few places) -> high school -> college -> "post-grad"
(after one's bachelor's degree) -> "post-doc" (after a PhD). [And then
"life", one hopes!]
-----
Rob Warnock, 31-2-510 <····@sgi.com>
SGI Network Engineering <URL:http://reality.sgi.com/rpw3/>
1600 Amphitheatre Pkwy. Phone: 650-933-1673
Mountain View, CA 94043 PP-ASEL-IA
[Note: Please don't use ·········@sgi.com or ········@sgi.com ]
····@rigden.engr.sgi.com (Rob Warnock) writes:
> Pierre R. Mai <····@acm.org> wrote:
> +---------------
> | ...it is likely that the infirmary generation is larger than
> | the 1st level cache.
> +---------------
>
> Do you really mean "infirmary" here? Doubtless this is a cross-language
No, of course not, I did mean nursery, as you point out. I plead the
late hour of the night, and certain strange language pathways that
sometimes spring up in my brain. That said, let's see how this might
have come about:
> difference in slang, but I'm curious as to the connotations of the word
> "infirmary" in your native tongue. In English, an "infirmary" is a hospital
> or clinic [a place to take care of sick people].
The picture that I might have had in mind, was that (especially in the
case at hand) most members of the nursery don't survive it, by virtue
of having little vitality left in them, i.e. of them being "infirm" in
some way. Mortality rates in nurseries luckilly don't exceed 10% most
of the time, hence infirmary and not nursery somehow sprang to mind...
> In the GC literature I've read, the slang term I've most often seen
> used for a fixed-location ephemeral allocation area is "nursery" --
> in English, a place where infant children (or plants) are cared for
> (or grown), which they eventually leave (or are transplanted from),
> never to return.
I've always thought that this was in slight contradiction to the
conjecture underlying ephemeral or generational garbage collectors,
namely that of high infant mortality. Since this was the sole reason
that ephemeral garbage collectors where thought to be a win at first,
it would be counter-productive to put nurture and care into those
infants. That is of course in stark contrast to the nurseries in
real-life.
As it later turned out, the conjecture doesn't actually have to hold,
in order for ephemeral GC to be a win, though.
Regs, Pierre.
--
Pierre R. Mai <····@acm.org> http://www.pmsf.de/pmai/
The most likely way for the world to be destroyed, most experts agree,
is by accident. That's where we come in; we're computer professionals.
We cause accidents. -- Nathaniel Borenstein
Here's the result of a very simple-minded test I did. I tried a
similar function on LW, Corman Lisp, and Petite Chez Scheme.
(People who think that Scheme is not Lisp please skip the results of
Chez).
It seems that, if the results of this test extends in some measure to
real programs, LW has room for improvement in memory management.
NOTES:
- Corman Lisp did not like LOOP, so I replaced the LOOP expression
with DOTIMES.
- In all three systems, GET-INTERNAL-RUN-TIME in the CLs and RUN-TIME
in PCS (which is documented as being equivalent to Cl's
GET-INTERNAL-RUN-TIME), apparently returns _real_time_, or something
very close to it, instead of run time. Anyway, there were no other
processes running.
- None of the systems had enough smarts to dead-code away the computation
being timed.
- The systems were tested as they come out of the box, with no
particular tuning.
================
;;; CL
(declaim (optimize speed (safety 0) (debug 0)))
(defconstant +hz-cpu+ 701.6e6)
(defun straconsa (n l)
(let* ((inizio (get-internal-run-time))
(foo (loop repeat n
do (make-list l)))
(fine (get-internal-run-time))
(qcons (* n l))
(tempo (float (/ (- fine inizio) internal-time-units-per-second)))
(cons-per-sec (/ qcons tempo))
(cicli-per-cons (* tempo +hz-cpu+ (/ qcons))))
(declare (ignore foo))
(format t "~&cons: ~S~%" qcons)
(format t "~&tempo: ~S~%" tempo)
(format t "~&cons per sec.: ~S~%" cons-per-sec)
(format t "~&cicli per cons: ~S~%" cicli-per-cons)))
================
WARNING! Scheme code follows!
Those who can't stand the sight, please skip the next 24 lines.
================
;;; Chez Scheme
(define hz-cpu 701.6e6)
(define internal-time-units-per-second 1000.0)
(define (straconsa n l)
(let* ((inizio (cpu-time))
(foo (repeat n
(make-list l)))
(fine (cpu-time))
(qcons (* n l))
(tempo (/ (- fine inizio) internal-time-units-per-second))
(cons-per-sec (/ qcons tempo))
(cicli-per-cons (* tempo hz-cpu (/ qcons))))
(printf "cons: ~S~%" qcons)
(printf "tempo: ~S~%" tempo)
(printf "cons per sec.: ~S~%" cons-per-sec)
(printf "cicli per cons: ~S~%" cicli-per-cons)))
================
RESULTS
-------
Cicles x cons (min/max of a few runs)
LW
(straconsa 1000 1000) 266/343
(straconsa 1000 100000) 1472/1534
Corman
(straconsa 1000 1000) 69/79
(straconsa 1000 100000) 369/379
PCS
(straconsa 1000 1000) 35/42
(straconsa 1000 100000) 74/75
====
Athlon / 701.6 Mhz(*) / Windows ME
Lispworks Personal Edition 4.1.20
Corman Lisp 1.42
Petite Chez Scheme Version 6.0a
(*) As reported by an utility called 'cpuz'. Nominally 700 Mhz.
--
Pierpaolo Bernardi
"romeo bernardi" <········@tin.it> writes:
> It seems that, if the results of this test extends in some measure to
> real programs, LW has room for improvement in memory management.
Like I wrote in my own silly-benchmark-of-the-week article, results
from benchmarks like these are pretty unlikely to give any real
insight into real-world programs, for a number of reasons:
1. Real world programs don't behave like this (generating tons conses
which are thrown away the moment they have been created and
initialized).
2. If they did, it would be trivial to optimize away this sort of waste
by resourcing the conses in question. See point 1.
3. Since GC performance is highly correlated with the tuning of the GC
parameters to the life-time patterns encountered, and given that
real world programs don't do the thing we are talking about (see
points 1 and 2), it seems highly likely that any serious
implementation isn't going to come tuned to this improbable pattern
out-of-the-box.
I have never looked at tuning GC parameters in LWL/LWW, hence I'll
leave speculations on better parameters to others.
Regs, Pierre.
--
Pierre R. Mai <····@acm.org> http://www.pmsf.de/pmai/
The most likely way for the world to be destroyed, most experts agree,
is by accident. That's where we come in; we're computer professionals.
We cause accidents. -- Nathaniel Borenstein
"Pierre R. Mai" <····@acm.org> ha scritto nel messaggio
···················@orion.bln.pmsf.de...
> "romeo bernardi" <········@tin.it> writes:
>
> > It seems that, if the results of this test extends in some measure to
> > real programs, LW has room for improvement in memory management.
>
> Like I wrote in my own silly-benchmark-of-the-week article, results
> from benchmarks like these are pretty unlikely to give any real
> insight into real-world programs, for a number of reasons:
Yes, yes, you are right. We both know that a CL cannot
be evaluated by a 2 lines loop.
I just wanted to test whether lispnewbie's impression had
some fundament in reality. As it happens, this impression
was not without base: at least in this artificial bench, LW
does not do well. Whether this has implications for real
use, clearly requires more research. 8-)
Cheers,
Pierpaolo
··········@questions.com writes:
> I have noticed that even though Lispworks compiled code is very fast in tight
> loops, it seems very slow when it involves memory allocation. It seems to
> take about 500 CPU cycles per cons cell, including whatever time that adds to
> garbage collection. In C++ I can allocate and free a memory cell by taking
> it off a "free list" and putting it back on, which adds up to just a few CPU
> instructions.
And then get a "free list corrupt" message in RPM that not even
rpm --rebuilddb
can fix?
I am really really serious here. This is the situation I am in right
now. My RPM db is messed up and all I am told I can do is to
reinstall RMP 3.xx, rebuild the database (meanwhile screwing up the
dependencies in other packages) and hope for the best.
All of this, because somebody sometime thought that it is "faster" to
go the C++ way.
It is true that I am using free software and therefore I shouldn't
complain, but I thought that this is a good cautionary tale about non
GCed languages :)
Cheers
--
Marco Antoniotti ========================================================
NYU Courant Bioinformatics Group tel. +1 - 212 - 998 3488
719 Broadway 12th Floor fax +1 - 212 - 995 4122
New York, NY 10003, USA http://bioinformatics.cat.nyu.edu
"Hello New York! We'll do what we can!"
Bill Murray in `Ghostbusters'.
>>>>> "lispnewbie" == lispnewbie <··········@questions.com> writes:
lispnewbie> garbage collection. In C++ I can allocate and free a
lispnewbie> memory cell by taking it off a "free list" and putting
lispnewbie> it back on, which adds up to just a few CPU
lispnewbie> instructions. I'm confused about why Lispworks takes
lispnewbie> so long. What is it
It's been a long time since I looked, but malloc and free aren't all
that cheap. Looking things up on a free list takes a lot of cycles,
and freeing does too. At least the fast malloc that comes with XEmacs
looks pretty expensive for malloc and free.
For comparison, on the sparc port of CMUCL, allocating a cons cell is
a handful of instructions to increment a pointer. GC is lot more,
though.
Ray
From: James Hague
Subject: Re: Speed of memory management in Lispworks
Date:
Message-ID: <3b1bdbf5_2@newsfeeds>
Raymond Toy:
>
> It's been a long time since I looked, but malloc and free aren't all
> that cheap. Looking things up on a free list takes a lot of cycles,
> and freeing does too. At least the fast malloc that comes with XEmacs
> looks pretty expensive for malloc and free.
malloc was traditionally coded with the assumptions that (a) most
dynamically allocated blocks will not be of a trivial size (i.e. allocating
blocks of 4 or 8 bytes is silly, as the header takes more space), and (b)
within reason, reducing fragmentation is more important than speed of
allocation (the underlying assumption being that calling malloc inside of
real-time loops is misguided).
With OOP, these characteristics changed. In a patch for Visual C++,
Microsoft incorporated a new memory manager that could improve allocation
speed by something like 40 times. But you can't assume that every
implementation of malloc is designed to be performance oriented.
James
-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----== Over 80,000 Newsgroups - 16 Different Servers! =-----
"James Hague" <···········@volition-inc.com> writes:
...
> With OOP, these characteristics changed. In a patch for Visual C++,
> Microsoft incorporated a new memory manager that could improve allocation
> speed by something like 40 times. But you can't assume that every
> implementation of malloc is designed to be performance oriented.
Just recently we ported a piece of code from Solaris/Linux/IRIX to
Windows. The piece of code has worked flawlessly for years under these
UN*X environments.
Then we ported it to Windows and the system started behaving badly in a
random way. We were compiling the piece of code under VC++. After a
couple of hours of head scratching, we finally noticed that what
should have been a null pointer was not null. It turned out that one
allocator function (note that we are working in C) was doing something
like
zut* z;
z = (zut*) malloc(sizeof(zut) * N);
/* happily use 'z' hereafter. */
Now the question. Why doesn't VC++ 'malloc' execute the missing
'memset 0' whe had to manually insert everywhere?
Or, why in the world should I worry about these things if I am not
writing a device driver for a network-enabled espresso machine? :)
lazily yours.
--
Marco Antoniotti ========================================================
NYU Courant Bioinformatics Group tel. +1 - 212 - 998 3488
719 Broadway 12th Floor fax +1 - 212 - 995 4122
New York, NY 10003, USA http://bioinformatics.cat.nyu.edu
"Hello New York! We'll do what we can!"
Bill Murray in `Ghostbusters'.
>>>>> "Marco" == Marco Antoniotti <·······@cs.nyu.edu> writes:
Marco> Just recently we ported a piece of code from Solaris/Linux/IRIX to
Marco> Windows. The piece of code has worked flawlessly for years under these
Marco> UN*X environments.
Marco> Then we ported it to Windows and the system started behaving badly in a
Marco> random way. We were compiling the piece of code under VC++. After a
Marco> couple of hours of head scratching, we finally noticed that what
Marco> should have been a null pointer was not null. It turned out that one
Marco> allocator function (note that we are working in C) was doing something
Marco> like
Marco> zut* z;
Marco> z = (zut*) malloc(sizeof(zut) * N);
Marco> /* happily use 'z' hereafter. */
Marco> Now the question. Why doesn't VC++ 'malloc' execute the missing
Marco> 'memset 0' whe had to manually insert everywhere?
But nowhere in the C standard says a NULL pointer is really 0. C++
has some other rule that says 0 is equivalent to NULL, but doesn't
actually say NULL is 0. (I think. I don't do C++ much anymore.)
(Or perhaps, I don't follow your example and comments.)
Ray
Raymond Toy <···@rtp.ericsson.se> writes:
> >>>>> "Marco" == Marco Antoniotti <·······@cs.nyu.edu> writes:
>
> Marco> Just recently we ported a piece of code from Solaris/Linux/IRIX to
> Marco> Windows. The piece of code has worked flawlessly for years under these
> Marco> UN*X environments.
>
> Marco> Then we ported it to Windows and the system started behaving badly in a
> Marco> random way. We were compiling the piece of code under VC++. After a
> Marco> couple of hours of head scratching, we finally noticed that what
> Marco> should have been a null pointer was not null. It turned out that one
> Marco> allocator function (note that we are working in C) was doing something
> Marco> like
>
> Marco> zut* z;
> Marco> z = (zut*) malloc(sizeof(zut) * N);
> Marco> /* happily use 'z' hereafter. */
>
> Marco> Now the question. Why doesn't VC++ 'malloc' execute the missing
> Marco> 'memset 0' whe had to manually insert everywhere?
>
> But nowhere in the C standard says a NULL pointer is really 0. C++
> has some other rule that says 0 is equivalent to NULL, but doesn't
> actually say NULL is 0. (I think. I don't do C++ much anymore.)
>
> (Or perhaps, I don't follow your example and comments.)
You are right. It is just a rant of mine against all the pitfalls you
have to be aware of when the language itself does not help you.
Cheers
--
Marco Antoniotti ========================================================
NYU Courant Bioinformatics Group tel. +1 - 212 - 998 3488
719 Broadway 12th Floor fax +1 - 212 - 995 4122
New York, NY 10003, USA http://bioinformatics.cat.nyu.edu
"Hello New York! We'll do what we can!"
Bill Murray in `Ghostbusters'.
"Marco Antoniotti" <·······@cs.nyu.edu> wrote in message ····················@octagon.mrl.nyu.edu...
>
> Raymond Toy <···@rtp.ericsson.se> writes:
>
> > >>>>> "Marco" == Marco Antoniotti <·······@cs.nyu.edu> writes:
> >
> > Marco> Just recently we ported a piece of code from Solaris/Linux/IRIX to
> > Marco> Windows. The piece of code has worked flawlessly for years under these
> > Marco> UN*X environments.
> >
> > Marco> Then we ported it to Windows and the system started behaving badly in a
> > Marco> random way. We were compiling the piece of code under VC++. After a
> > Marco> couple of hours of head scratching, we finally noticed that what
> > Marco> should have been a null pointer was not null. It turned out that one
> > Marco> allocator function (note that we are working in C) was doing something
> > Marco> like
> >
> > Marco> zut* z;
> > Marco> z = (zut*) malloc(sizeof(zut) * N);
> > Marco> /* happily use 'z' hereafter. */
> >
> > Marco> Now the question. Why doesn't VC++ 'malloc' execute the missing
> > Marco> 'memset 0' whe had to manually insert everywhere?
> >
> > But nowhere in the C standard says a NULL pointer is really 0. C++
> > has some other rule that says 0 is equivalent to NULL, but doesn't
> > actually say NULL is 0. (I think. I don't do C++ much anymore.)
> >
> > (Or perhaps, I don't follow your example and comments.)
>
> You are right. It is just a rant of mine against all the pitfalls you
> have to be aware of when the language itself does not help you.
>
I'd prefer to think of it as a rant against programmers relying on
implementation-dependant behaviour. Just because xyz's lisp does
-(DEFVAR X '(1 2 3 4))
X
-(NREVERSE X)
(4 3 2 1)
-X
(4 3 2 1)
doesn't mean I should skip the SETF.
Geoff
Raymond Toy <···@rtp.ericsson.se> writes:
> But nowhere in the C standard says a NULL pointer is really 0. C++
Yes it does. See sections 6.3.2.3 and 7.17 of ISO/IEC 9899:1999. A
null pointer constant is:
An integer constant expression with the value 0, or
such an expression cast to type void *, is called a null
pointer constant.46) If a null pointer constant is assigned
to or compared for equality to a pointer, the constant is
converted to a pointer of that type. Such a pointer, called
a null pointer, is guaranteed to compare unequal to a
pointer to any object or function.
I.e. it the following code fragment portably tests whether z is a null
pointer:
if (z == 0)
stop_the_world();
Furthermore it is certain that the following expression is always
true:
NULL == 0
The thing that is left open is the bit-pattern(s) that represent
null pointers at the memory level, which is relevant in this case to
some degree.
Regs, Pierre.
--
Pierre R. Mai <····@acm.org> http://www.pmsf.de/pmai/
The most likely way for the world to be destroyed, most experts agree,
is by accident. That's where we come in; we're computer professionals.
We cause accidents. -- Nathaniel Borenstein
In article <··············@orion.bln.pmsf.de>, Pierre R. Mai wrote:
>Raymond Toy <···@rtp.ericsson.se> writes:
>
>> But nowhere in the C standard says a NULL pointer is really 0. C++
>
>Yes it does. See sections 6.3.2.3 and 7.17 of ISO/IEC 9899:1999. A
>null pointer constant is:
i don't have the standard at hand, but i think the implication of those
sections is that "if the literal '0' is used in a pointer context it is
equivalent to the NULL pointer". afaict, the main reason for the macro
NULL (usually implemented as a cast of 0 to a suitable pointer type) was
to provide for cases where the bit pattern for the null pointer is not
the machine implementation for the literal 0.
> ...
--
hs
----------------------------------------------------------------
"The cheapest pride is national pride. I demonstrates the lack of
characteristics and achievements you can be proud of. The worst loser
can have national pride" - Schopenhauer
··@paradise.nirvananet (Hartmann Schaffer) writes:
> In article <··············@orion.bln.pmsf.de>, Pierre R. Mai wrote:
> >Raymond Toy <···@rtp.ericsson.se> writes:
> >
> >> But nowhere in the C standard says a NULL pointer is really 0. C++
> >
> >Yes it does. See sections 6.3.2.3 and 7.17 of ISO/IEC 9899:1999. A
> >null pointer constant is:
>
> i don't have the standard at hand, but i think the implication of those
> sections is that "if the literal '0' is used in a pointer context it is
> equivalent to the NULL pointer". afaict, the main reason for the macro
> NULL (usually implemented as a cast of 0 to a suitable pointer type) was
> to provide for cases where the bit pattern for the null pointer is not
> the machine implementation for the literal 0.
Just to close this whole thread out: Please re-read my postings and
the standard. NULL is _always_ defined as any kind of null pointer
constant (again see 7.17). NULL is defined just for aesthetic
reasons. This whole issue is completely seperate from the
"low-level" representation of null pointers, which can be any kind of
bit pattern, and can vary from pointer type to pointer type.
It was this issue of NULL vs. 0 that I was commenting on, not the
issue of "low-level" representation, which Raymond rightly pointed
out. In fact I might have quoted a bit to concisely, thereby
misleading others into think I was speaking about the low-level
representation of null pointers.
For further information refer to the relevant standards. The
meta-reason why I repeatedly brought up these unrelated language
issues is that I feel that if we discuss C, we should do it with the
same amount of accuracy that we afford CL.
Regs, Pierre.
--
Pierre R. Mai <····@acm.org> http://www.pmsf.de/pmai/
The most likely way for the world to be destroyed, most experts agree,
is by accident. That's where we come in; we're computer professionals.
We cause accidents. -- Nathaniel Borenstein
In article <··············@orion.bln.pmsf.de>,
Pierre R. Mai <····@acm.org> wrote:
>Raymond Toy <···@rtp.ericsson.se> writes:
>
>> But nowhere in the C standard says a NULL pointer is really 0. C++
>
>Yes it does. See sections 6.3.2.3 and 7.17 of ISO/IEC 9899:1999. A
>null pointer constant is:
>
> An integer constant expression with the value 0, or
That's a null pointer constant, not a null pointer. He's talking about the
representation, not the syntax.
Assuming the following declarations:
int *p;
int i = 0;
You can assign a null pointer to p with:
p = 0;
But neither:
p = i;
nor
memset(&p, 0, sizeof p);
are guaranteed to assign a null pointer to p. It will work in most
architectures, mainly because the implementors chose to make the
representation mirror the syntax (perhaps because there was lots of old
code that made this assumption), but it's not required by the language.
Let's end this debate now. This isn't comp.{lang,std}.c, and the issue
gets debated in those groups more than enough. We don't need to have it
here.
--
Barry Margolin, ······@genuity.net
Genuity, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.
From: George Neuner
Subject: Re: Speed of memory management in Lispworks
Date:
Message-ID: <3b1d1317.500203174@helice>
On 04 Jun 2001 16:34:44 -0400, Raymond Toy <···@rtp.ericsson.se>
wrote:
>But nowhere in the C standard says a NULL pointer is really 0. C++
>has some other rule that says 0 is equivalent to NULL, but doesn't
>actually say NULL is 0. (I think. I don't do C++ much anymore.)
All the standard says is that NULL is a pointer value distinguishable
from all valid pointers. The value zero just happens to be convenient
on most machines.
George
Raymond Toy <···@rtp.ericsson.se> writes:
> Marco> Now the question. Why doesn't VC++ 'malloc' execute the missing
> Marco> 'memset 0' whe had to manually insert everywhere?
>
> But nowhere in the C standard says a NULL pointer is really 0.
The `memset' Marco is talking about has nothing to do with NULL
pointers, really. From what I gather, the source (wrongly) assumed
that all the memory returned by malloc will be cleared with zeros.
As far C goes, it is true that NULL pointer need not have a value of
zero, but it can be represented by `0' in a pointer context.
Therefore:
char *foo = 0;
/* `foo' is now a NULL pointer, and comparing it to NULL or to 0
will yield true. However, it doesn't mean that &foo contains an
all-zero bit pattern. */
The C definition of NULL pointer is pretty weird, in fact so weird
that few C programmers understand it. The C FAQ dedicates a whole
chapter to the topic.
For instance, under a strict C definition of NULL pointer, the
standard practice of calling memset(struct_ptr, 0, sizeof(*struct_ptr))
does not guarantee that the pointer members of STRUCT_PTR will be
initialized to NULL!
In article <···············@octagon.mrl.nyu.edu>,
Marco Antoniotti <·······@cs.nyu.edu> wrote:
>Now the question. Why doesn't VC++ 'malloc' execute the missing
>'memset 0' whe had to manually insert everywhere?
Because malloc() just does memory allocation, not initialization. If you
want the memory to be zeroed, use calloc() instead of malloc().
Note, however, that memset 0 on a pointer has undefined consequences, as
the other responder mentioned. If you want to null out a pointer, you have
to assign a null pointer constant to it, not use memset() or calloc().
--
Barry Margolin, ······@genuity.net
Genuity, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.
Barry Margolin <······@genuity.net> writes:
> In article <···············@octagon.mrl.nyu.edu>,
> Marco Antoniotti <·······@cs.nyu.edu> wrote:
> >Now the question. Why doesn't VC++ 'malloc' execute the missing
> >'memset 0' whe had to manually insert everywhere?
>
> Because malloc() just does memory allocation, not initialization. If you
> want the memory to be zeroed, use calloc() instead of malloc().
>
> Note, however, that memset 0 on a pointer has undefined consequences, as
> the other responder mentioned. If you want to null out a pointer, you have
> to assign a null pointer constant to it, not use memset() or calloc().
Very good point indeed. Ranting apart, this was a very good piece of
advice.
Cheers
--
Marco Antoniotti ========================================================
NYU Courant Bioinformatics Group tel. +1 - 212 - 998 3488
719 Broadway 12th Floor fax +1 - 212 - 995 4122
New York, NY 10003, USA http://bioinformatics.cat.nyu.edu
"Hello New York! We'll do what we can!"
Bill Murray in `Ghostbusters'.
Marco Antoniotti <·······@cs.nyu.edu> writes:
> zut* z;
> z = (zut*) malloc(sizeof(zut) * N);
> /* happily use 'z' hereafter. */
>
> Now the question. Why doesn't VC++ 'malloc' execute the missing
> 'memset 0' whe had to manually insert everywhere?
Because zero initializing memory is not part of malloc's contract (see
section 7.20.3.3 of ISO/IEC 9899:1999). If you want all bits zero
initialized memory, you need to use calloc (section 7.20.3.1).
This is not different from ANSI CL, where e.g. make-array is not
required to initialize the elements of the string it allocates, unless
you specify an initial-element or initial-contents.
> Or, why in the world should I worry about these things if I am not
> writing a device driver for a network-enabled espresso machine? :)
While I agree with you that C requires users to worry about too many
details that the system should worry about instead, I don't think that
in this particular instance C is to blame for the defect, IMHO.
Regs, Pierre.
--
Pierre R. Mai <····@acm.org> http://www.pmsf.de/pmai/
The most likely way for the world to be destroyed, most experts agree,
is by accident. That's where we come in; we're computer professionals.
We cause accidents. -- Nathaniel Borenstein
From: George Neuner
Subject: Re: Speed of memory management in Lispworks
Date:
Message-ID: <3b1d15a7.500859357@helice>
On 04 Jun 2001 15:56:39 -0400, Marco Antoniotti <·······@cs.nyu.edu>
wrote:
>Now the question. Why doesn't VC++ 'malloc' execute the missing
>'memset 0' whe had to manually insert everywhere?
Because malloc() is a primitive function upon which you build more
complicated things ... such as allocators that automagically zero
memory.
>Or, why in the world should I worry about these things if I am not
>writing a device driver for a network-enabled espresso machine? :)
If you are about to fill the memory with something else, clearing it
was a waste of time. 'C' is a systems language - it is about
efficiency, not protecting you from yourself.
George
From: Tim Bradshaw
Subject: Re: Speed of memory management in Lispworks
Date:
Message-ID: <nkjn17okywt.fsf@tfeb.org>
··········@questions.com writes:
> I have noticed that even though Lispworks compiled code is very fast in tight
> loops, it seems very slow when it involves memory allocation. It seems to
> take about 500 CPU cycles per cons cell, including whatever time that adds to
> garbage collection. In C++ I can allocate and free a memory cell by taking
> it off a "free list" and putting it back on, which adds up to just a few CPU
> instructions. I'm confused about why Lispworks takes so long. What is it
> doing? Is it the garbage collection that takes most of the time?
This is a kind of belated followup, sorry.
In general a good modern GC can be made to be very fast indeed if
you're willing to tune it. Although the kind of `application' you
have is pretty non-typical, it can be made very fast in a copying
generational GC by making the first generation very large (if you have
enough memory). This causes the GC to run very infrequently, and since
it does basically no work when it runs (as everything is garbage).
The typical allocation mechanism for such a GC is increment a pointer,
check a bound and initialize. This is almost definitively fast - it's
really hard to see how any system can allocate much faster than this.
There are actually a couple of tricks - you may be ablie to avoid the
bounds check in some cases if you know the allocation size; you may be
able to stack allocate (which can avoid the bounds check and also be
more cache-friendly perhaps); and finally you may be able to avoid
initialising for things like arrays of non-pointer types. Many Lisp
systems can do some or all of these things. I have persuaded Lisp
systems to allocate at a pretty high fraction of memory bandwidth
without very much tuning.
There was a thread where this was discussed a while ago (late last
year?). If you look for something like `allocation' and either my
name or Duane Rettig's in deja (erm, google?) you will probably find
it.
In the specific case of LW I don't know enough about the GC to know
what the issues are. However you can get dramatically better
performance by testing with something like (make-array n :element-type
'fixnum), so I suspect what you may be seeing is that MAKE-LIST is not
really optimized, and alternatively if you loop doing CONS each time
you tend to drown in loop overhead. Unfortunately the machine I to
hand is a laptop and it has a funny (slow) memory system like many
laptops I think so I don't really trust the numbers I get from it.
--tim