From: Rob Thorpe
Subject: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1125589093.922818.149400@g14g2000cwa.googlegroups.com>
A while ago Jon Harrop showed a small raytracer in OCaml.  Several
people ported it to lisp, a huge meandering thread ensued.  I wrote a
lisp version, although my version was short it was very slow.  Of the
versions that were posted Nathan Baum's was the fastest (on my machine
anyway).  It's posted here:-

http://groups.google.co.uk/group/comp.lang.lisp/msg/1a2aa1a6aff6bf29

After looking at the code of the raytracer I noticed that most of the
work is done by three functions: "dot", "v-" and "ray_sphere".

Dot is:
(defun dot (a b)
  (+ (* (x a) (x b)) (* (y a) (y b)) (* (z a) (z b))))

where (x a) etc are defstruct accesses into the defstruct:
(defstruct (vec (:conc-name nil) (:constructor vec (x y z)))
  (x 0.0 :type single-float)
  (y 0.0 :type single-float)
  (z 0.0 :type single-float))

It is called more than any other function, many tens of thousands of
times even from a small raytracing.  The other functions are also quite
simple


This lisp version is slower than many versions in other languages, but
the biggest difference is in memory use.  Nathan's lisp version seems
to cons a very large amount of memory (1GB when producing a 160x160
pixel raytracing on CMUCL).  My version also conses loads, in both
cases the real memory usage is low and GC recycles 99%.  It seems that
the functions mentioned above cause this consing.  (It's hard to
measure since the enabling profiling causes even more consing.)

Does anyone know why these simple functions should cause so much
consing?

From: Nathan Baum
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1125606150.839198.60200@g47g2000cwa.googlegroups.com>
Rob Thorpe wrote:
> A while ago Jon Harrop showed a small raytracer in OCaml.  Several
> people ported it to lisp, a huge meandering thread ensued.  I wrote a
> lisp version, although my version was short it was very slow.  Of the
> versions that were posted Nathan Baum's was the fastest (on my machine
> anyway).  It's posted here:-
>
> http://groups.google.co.uk/group/comp.lang.lisp/msg/1a2aa1a6aff6bf29

Disclaimer: This is terrible code. I should be shot. Caveat emptor.

> This lisp version is slower than many versions in other languages, but
> the biggest difference is in memory use.  Nathan's lisp version seems
> to cons a very large amount of memory (1GB when producing a 160x160
> pixel raytracing on CMUCL).  My version also conses loads, in both
> cases the real memory usage is low and GC recycles 99%.  It seems that
> the functions mentioned above cause this consing.  (It's hard to
> measure since the enabling profiling causes even more consing.)

On SBCL, it takes 800MB without profiling turned on.

> Does anyone know why these simple functions should cause so much
> consing?

As others have surmised, the return values get boxed. Inlining doesn't
help (not with SBCL, at least).

I achieved a considerable speed increase (but not quite an order of
magnitude) by taking the ((v    (v- (center sphere) (orig ray))) out of
ray-sphere and performing the vector subtraction inline upon a global
variable. Not pretty, but it works.
From: Paul Dietz
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <df7jb2$qa9$1@avnika.corp.mot.com>
Rob Thorpe wrote:

> Does anyone know why these simple functions should cause so much
> consing?

The results are boxed?

	Paul
From: Juho Snellman
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <slrndhefee.3i4.jsnell@sbz-30.cs.Helsinki.FI>
<·············@antenova.com> wrote:
> This lisp version is slower than many versions in other languages, but
> the biggest difference is in memory use.  Nathan's lisp version seems
> to cons a very large amount of memory (1GB when producing a 160x160
> pixel raytracing on CMUCL).  My version also conses loads, in both
> cases the real memory usage is low and GC recycles 99%. It seems that
> the functions mentioned above cause this consing.

No. The consing is coming from creating structures for the return
values unitize, v+, v- and v*. Most of these structures are temporary,
so a sufficently smart compiler could optimize them away. or at least
allocate them on the stack. Unfortunately CMUCL and SBCL aren't smart
enough. To get rid of the consing your best bet is probably optimizing
the structure allocations away by hand, like in
<http://www.cs.helsinki.fi/u/jesnellm/tmp/ray-mv.lisp>.

Tested only on SBCL, where it conses very little (<10MB for a 160x160
image) and is an order of magnitude faster than the version by Nathan
Baum. The performance characteristics of different CL implementations
are so different that I won't make any promises about it being fast on
other implementations.

> (It's hard to
> measure since the enabling profiling causes even more consing.)

For a low-overhead profiling solution, have a look at the sampling
profiler. It's included with SBCL as sb-sprof, for CMUCL I think
you'll need to fetch it from the cmucl-imp mailing list archives. I'm
not sure how well the CMUCL version works. It measures time spent in
various functions, not consing, but trying to optimize for low consing
instead of low execution time seems rather pointless.

-- 
Juho Snellman
"Premature profiling is the root of all evil."
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1125648440.841702.324630@z14g2000cwz.googlegroups.com>
Juho Snellman wrote:
> <·············@antenova.com> wrote:
> > This lisp version is slower than many versions in other languages, but
> > the biggest difference is in memory use.  Nathan's lisp version seems
> > to cons a very large amount of memory (1GB when producing a 160x160
> > pixel raytracing on CMUCL).  My version also conses loads, in both
> > cases the real memory usage is low and GC recycles 99%. It seems that
> > the functions mentioned above cause this consing.
>
> No. The consing is coming from creating structures for the return
> values unitize, v+, v- and v*. Most of these structures are temporary,
> so a sufficently smart compiler could optimize them away. or at least
> allocate them on the stack. Unfortunately CMUCL and SBCL aren't smart
> enough.

Ah, that's the problem, that makes sense.

> To get rid of the consing your best bet is probably optimizing
> the structure allocations away by hand, like in
> <http://www.cs.helsinki.fi/u/jesnellm/tmp/ray-mv.lisp>.
>
> Tested only on SBCL, where it conses very little (<10MB for a 160x160
> image) and is an order of magnitude faster than the version by Nathan
> Baum. The performance characteristics of different CL implementations
> are so different that I won't make any promises about it being fast on
> other implementations.

That's much better, it's a pity it requires such fiddling.

> > (It's hard to
> > measure since the enabling profiling causes even more consing.)
>
> For a low-overhead profiling solution, have a look at the sampling
> profiler. It's included with SBCL as sb-sprof, for CMUCL I think
> you'll need to fetch it from the cmucl-imp mailing list archives. I'm
> not sure how well the CMUCL version works. It measures time spent in
> various functions, not consing, but trying to optimize for low consing
> instead of low execution time seems rather pointless.

I knew there was a better way to profile it I just didn't know what it
was, thanks.
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <43186f78$0$97114$ed2619ec@ptn-nntp-reader03.plus.net>
Rob Thorpe wrote:
> Juho Snellman wrote:
>> No. The consing is coming from creating structures for the return
>> values unitize, v+, v- and v*. Most of these structures are temporary,
>> so a sufficently smart compiler could optimize them away. or at least
>> allocate them on the stack. Unfortunately CMUCL and SBCL aren't smart
>> enough.
> 
> Ah, that's the problem, that makes sense.

I'm not sure that is the whole picture though. In comparison, I don't
believe ocamlopt will optimise the temporary allocation away and (with no
inlining: "-inline 0") ocamlopt is still much faster than SBCL, taking
21.7s instead of 17.4s with inlining (-inline 100). This indicates to me
that inlining is one possible solution but a better solution is to have a
GC that is much faster at allocating and immediately collecting
temporaries. So my guess is that SBCL's GC needs some work.

Interestingly, Sun's Java has the same problem as SBCL here - it is very
slow at allocating and immediately collecting millions of temporary
vectors, effectively using malloc and free. The fix for the Java code
(manually inlining all vector calculations) is even more hideous than the
fix for the CL code.

Incidentally, Juho's implementation is the fastest I received (only ~20%
slower but >3x as long) and Nathan's code is the second fastest (~2x slower
and ~2x as long).

Lispers may be interested to know that the author of the Scheme compiler
Stalin just published results in comp.lang.scheme showing that Stalin can
generate faster code than all of the other languages and, IMHO, in a
reasonably fair test. However, compile times are three orders of magnitude
slower for Stalin than ocamlopt.

Finally, presumably the vector calculations can be expanded at compile time
using metaprogramming techniques similar to those performed by the C++
libraries Blitz++ and Boost?

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1125685547.050784.15500@f14g2000cwb.googlegroups.com>
Jon Harrop wrote:
> Rob Thorpe wrote:
> > Juho Snellman wrote:
> >> No. The consing is coming from creating structures for the return
> >> values unitize, v+, v- and v*. Most of these structures are temporary,
> >> so a sufficently smart compiler could optimize them away. or at least
> >> allocate them on the stack. Unfortunately CMUCL and SBCL aren't smart
> >> enough.
> >
> > Ah, that's the problem, that makes sense.
>
> I'm not sure that is the whole picture though. In comparison, I don't
> believe ocamlopt will optimise the temporary allocation away and (with no
> inlining: "-inline 0") ocamlopt is still much faster than SBCL, taking
> 21.7s instead of 17.4s with inlining (-inline 100). This indicates to me
> that inlining is one possible solution but a better solution is to have a
> GC that is much faster at allocating and immediately collecting
> temporaries. So my guess is that SBCL's GC needs some work.

I'm not sure about that conclusion.  It wouldn't surprise me if there
are more lisp temporaries, and/or they're bigger.  Can you make
ocamlopt show the amount of memory it allocates overall?

> Interestingly, Sun's Java has the same problem as SBCL here - it is very
> slow at allocating and immediately collecting millions of temporary
> vectors, effectively using malloc and free.

That could be the problem, many GCs are bad at that.

> The fix for the Java code
> (manually inlining all vector calculations) is even more hideous than the
> fix for the CL code.

Although Juho's fix isn't that general, I think a you could write a
quite general fix for this.  ie. write a whole different set of array
operations that would work better in this regard.  They could be a bit
odd though.

> Incidentally, Juho's implementation is the fastest I received (only ~20%
> slower but >3x as long) and Nathan's code is the second fastest (~2x slower
> and ~2x as long).
>
> Lispers may be interested to know that the author of the Scheme compiler
> Stalin just published results in comp.lang.scheme showing that Stalin can
> generate faster code than all of the other languages and, IMHO, in a
> reasonably fair test.

I read that, it's very impressive.  I think it solves the problem above
by unboxing.

> However, compile times are three orders of magnitude
> slower for Stalin than ocamlopt.

Wow, that's a lot of tea breaks.

> Finally, presumably the vector calculations can be expanded at compile time
> using metaprogramming techniques similar to those performed by the C++
> libraries Blitz++ and Boost?

Yes.  I think you could do the things you can do with those libraries,
but I don't know them so I can't really say.  You can certainly produce
specialised sets of functions if you need them.  Maybe someone else can
comment.

The compiler is available at runtime in lisp, so you can create special
data-structures then compile the functions for them if you need to.
I've never done this though it's probably only necessary once in a blue
moon.
From: Juho Snellman
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <slrndhh5bt.6nd.jsnell@sbz-30.cs.Helsinki.FI>
<······@jdh30.plus.com> wrote:
> I'm not sure that is the whole picture though. In comparison, I don't
> believe ocamlopt will optimise the temporary allocation away and (with no
> inlining: "-inline 0") ocamlopt is still much faster than SBCL, taking
> 21.7s instead of 17.4s with inlining (-inline 100). This indicates to me
> that inlining is one possible solution but a better solution is to have a
> GC that is much faster at allocating and immediately collecting
> temporaries. 

The cost of the temporaries isn't just allocation and GC. There's also
the cost of moving the floats between memory and registers/x87 stack. 

> So my guess is that SBCL's GC needs some work.

Some of that work has been done during the summer, but not merged into
the main branch. I'm hoping to do that for 0.9.6, to be released in
about two months. Anyone wanting to play around with the changes, feel
free to checkout the gencgc-page-table-branch from SBCL CVS (for
instructions see <http://sourceforge.net/cvs/?group_id=1373>). It's
about 20% faster than vanilla SBCL for the consy CL version of the
raytracer on x86-64.

> Incidentally, Juho's implementation is the fastest I received (only ~20%
> slower but >3x as long) and Nathan's code is the second fastest (~2x slower
> and ~2x as long).

It's not 3x as long by any reasonable metric. Lines of code aren't
reasonable; I could've mashed everything on one line and removed all
blank lines like in the C++ and Ocaml versions, but then it would've
also been as unreadable as the Ocaml and C++. For example it's got
about 15% more whitespace-separated tokens than the Ocaml code.

-- 
Juho Snellman
"Premature profiling is the root of all evil."
From: Vesa Karvonen
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <dfa6l6$eaj$1@oravannahka.helsinki.fi>
Juho Snellman <······@iki.fi> wrote:
> <······@jdh30.plus.com> wrote:
[...]
> > Incidentally, Juho's implementation is the fastest I received (only ~20%
> > slower but >3x as long) and Nathan's code is the second fastest (~2x slower
> > and ~2x as long).

> It's not 3x as long by any reasonable metric. Lines of code aren't
> reasonable; I could've mashed everything on one line and removed all
> blank lines like in the C++ and Ocaml versions, but then it would've
> also been as unreadable as the Ocaml and C++. For example it's got
> about 15% more whitespace-separated tokens than the Ocaml code.

Don't waste your time. All this has been explained to Jon several times
already. (Try googling for "Jon Harrop" and "verbose":
  http://groups.google.fi/groups?hl=fi&q=jon+harrop+verbose .)

-Vesa Karvonen
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1125688723.572144.72300@g14g2000cwa.googlegroups.com>
Vesa Karvonen wrote:
> Juho Snellman <······@iki.fi> wrote:
> > <······@jdh30.plus.com> wrote:
> [...]
> > > Incidentally, Juho's implementation is the fastest I received (only ~20%
> > > slower but >3x as long) and Nathan's code is the second fastest (~2x slower
> > > and ~2x as long).
>
> > It's not 3x as long by any reasonable metric. Lines of code aren't
> > reasonable; I could've mashed everything on one line and removed all
> > blank lines like in the C++ and Ocaml versions, but then it would've
> > also been as unreadable as the Ocaml and C++. For example it's got
> > about 15% more whitespace-separated tokens than the Ocaml code.
>
> Don't waste your time. All this has been explained to Jon several times
> already. (Try googling for "Jon Harrop" and "verbose":
>   http://groups.google.fi/groups?hl=fi&q=jon+harrop+verbose .)

I don't think Jon is actually measuring the whitespace lines, just the
code lines.

I think although some of the short lisp/scheme programs are arguably of
similar verbosity to the OCaml program Juho's code above is much more
verbose.  All the stuff at the beginning is significant extra meaning,
not just extra text.  The use of nested quasiquotation is particularly
difficult - to me at least.
From: Juho Snellman
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <slrndhhjtn.9he.jsnell@sbz-30.cs.Helsinki.FI>
<·············@antenova.com> wrote:
> I don't think Jon is actually measuring the whitespace lines, just the
> code lines.

Then I don't know where Jon is getting >3x from. Unless he's comparing
the CL program to the simplest ML program when talking of size, and 
the most optimized one when talking of speed.

> I think although some of the short lisp/scheme programs are arguably of
> similar verbosity to the OCaml program Juho's code above is much more
> verbose.  All the stuff at the beginning is significant extra meaning,
> not just extra text.

Oh, sure. It was not written for direct comparisons against the
programs Jon has on his site. It was basically a fairly mechanical 
transformation on what Nathan wrote, to see what a SSC could do on the
code.

If you wanted to actually compare a CL program to Jon's programs, you'd
need to implement all the optimizations that are implemented for the
other languages, instead of just using the basic version of the tracer.
For example, implementing the first optimization (better bounding) as
in <http://www.cs.helsinki.fi/u/jesnellm/tmp/ray-mv-2.lisp> is a 25%
speedup, after which the CL version runs faster than the most optimized
Ocaml version on my x86-64. (I don't care enough to implement the other
optimizations, this one was so easy that it was worth implementing just
to illustrate the point).

And yes, comparing that version to Jon's programs is still pointless
because it is also significantly more verbose (by which I mean 50%
more, not 200%). It might be interesting to see how this sort of manual
unboxing would be implemented in the other languages; in CL it's easy
to implement with multiple-values, easy to hide the m-v ugliness behind
macros, and reasonably easy to generate those convenience macros using
other macros.

-- 
Juho Snellman
"Premature profiling is the root of all evil."
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <431a2ec1$0$22920$ed2619ec@ptn-nntp-reader01.plus.net>
Juho Snellman wrote:
> <·············@antenova.com> wrote:
>> I don't think Jon is actually measuring the whitespace lines, just the
>> code lines.
> 
> Then I don't know where Jon is getting >3x from. Unless he's comparing
> the CL program to the simplest ML program when talking of size, and
> the most optimized one when talking of speed.

Until now, I have been comparing the CL to the first (slowest and shortest)
implementations in the other languages only. Although the Lisp is more
optimised, this comparison is justified by the similar running times.

>> I think although some of the short lisp/scheme programs are arguably of
>> similar verbosity to the OCaml program

Comparing the shortest Lisp program I have (one of Nathan's) to the first
(the shortest) OCaml on my site:

59 587 2534 ray.ml
80 555 3600 ray.lisp

That is not really a fair comparison because the OCaml is vastly faster than
the Lisp. A better comparison would be with the shortest OCaml
implementation that I have managed to write:

42 478 2171 ray.ml

This is closer, of course, but I would still say that the OCaml is
significantly more concise. Specifically, the Lisp has 1.9x, 1.2x and 1.7x
more bytes, words and lines, respectively.

>> Juho's code above is much more 
>> verbose.  All the stuff at the beginning is significant extra meaning,
>> not just extra text.
> 
> Oh, sure. It was not written for direct comparisons against the
> programs Jon has on his site. It was basically a fairly mechanical
> transformation on what Nathan wrote, to see what a SSC could do on the
> code.

I was under the impression that all of the Lisp implementations were
comparable to the first implementations in the other languages only.

> If you wanted to actually compare a CL program to Jon's programs, you'd
> need to implement all the optimizations that are implemented for the
> other languages, instead of just using the basic version of the tracer.

Your CL is more optimised that the first (slowest) implementations that I am
comparing it with.

> For example, implementing the first optimization (better bounding) as
> in <http://www.cs.helsinki.fi/u/jesnellm/tmp/ray-mv-2.lisp> is a 25%
> speedup, after which the CL version runs faster than the most optimized
> Ocaml version on my x86-64. (I don't care enough to implement the other
> optimizations, this one was so easy that it was worth implementing just
> to illustrate the point).

Are you sure? The fastest Lisp is 1.9x slower than the fourth (the fastest)
OCaml here but only 20% slower than the first (the slowest) OCaml. Indeed,
none of the Lisp implementations beat even the slowest of the most
unoptimised C++, SML or OCaml programs on either of my computers (Athlon
t-bird or Athlon64).

> And yes, comparing that version to Jon's programs is still pointless
> because it is also significantly more verbose (by which I mean 50%
> more, not 200%).

How are you measuring verbosity? I get 2.4x and 2.6x longer for LOC and
bytes, respectively, comparing the fastest Lisp to the fourth (the fastest)
OCaml. As you have said, that isn't a fair comparison because the OCaml
implements a lot more and runs a lot faster.

> It might be interesting to see how this sort of manual 
> unboxing would be implemented in the other languages; in CL it's easy
> to implement with multiple-values, easy to hide the m-v ugliness behind
> macros, and reasonably easy to generate those convenience macros using
> other macros.

Stalin-compiled Scheme and MLton-compiled SML unbox automatically. In OCaml,
manual unboxing would be quite tedious. The data are already somewhat
unboxed in the sense that rays and sphere are passed around as orig/dir and
center/radius, respectively.

In practice, manual unboxing is certainly a premature optimisation in this
case.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Juho Snellman
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <slrndhlai1.j1r.jsnell@sbz-31.cs.Helsinki.FI>
<······@jdh30.plus.com> wrote:
> This is closer, of course, but I would still say that the OCaml is
> significantly more concise. Specifically, the Lisp has 1.9x, 1.2x and 1.7x
> more bytes, words and lines, respectively.

Of those, as you've applied them, I think words is the only reasonable
measure. Bytes might be at least borderline reasonable if you ignored
leading whitespace. But hey, obviously this point has already been 
beaten to death by others with little effect. If words or
non-whitespace characters don't get you the results you want, feel
free to continue measuring "verbosity" using lines.

> Are you sure? The fastest Lisp is 1.9x slower than the fourth (the fastest)
> OCaml here but only 20% slower than the first (the slowest) OCaml. Indeed,
> none of the Lisp implementations beat even the slowest of the most
> unoptimised C++, SML or OCaml programs on either of my computers (Athlon
> t-bird or Athlon64).

Of course I'm sure :-) I get completely different results:

+./time.sh:2> ocamlopt -inline 100 ray1.ml -o ray1
+./time.sh:2> ./ray1 6 512
./ray1 6 512 > ocaml-ray1.pgm  11.72s user 0.01s system 99% cpu 11.734 total
+./time.sh:3> ocamlopt -inline 100 ray2.ml -o ray2
+./time.sh:3> ./ray2 6 512
./ray2 6 512 > ocaml-ray2.pgm  8.18s user 0.00s system 99% cpu 8.203 total
+./time.sh:4> ocamlopt -inline 100 ray4.ml -o ray2
+./time.sh:4> ./ray4 6 512
./ray4 6 512 > ocaml-ray4.pgm  7.32s user 0.01s system 99% cpu 7.337 total
+./time.sh:5> /opt/packages/sbcl/0.9.4/bin/sbcl --noinform --userinit /dev/null
  --load ray-mv.lisp --eval '(time (main 6 "lisp-1.pgm" 512))' --eval '(quit)'
Evaluation took:
  9.529 seconds of real time
  9.396572 seconds of user run time
  0.118981 seconds of system run time
  0 page faults and
  96,402,704 bytes consed.
+./time.sh:7> /opt/packages/sbcl/0.9.4/bin/sbcl --noinform --userinit /dev/null
  --load ray-mv-2.lisp --eval '(time (main 6 "lisp-2.pgm" 512))' --eval '(quit)'
Evaluation took:
  6.815 seconds of real time
  6.682984 seconds of user run time
  0.124981 seconds of system run time
  0 page faults and
  98,306,896 bytes consed.
+./time.sh:9> md5sum lisp-1.pgm lisp-2.pgm ocaml-ray1.pgm ocaml-ray2.pgm 
  ocaml-ray4.pgm
63a359e5c388f2004726d83d4337f56b  lisp-1.pgm
63a359e5c388f2004726d83d4337f56b  lisp-2.pgm
63a359e5c388f2004726d83d4337f56b  ocaml-ray1.pgm
63a359e5c388f2004726d83d4337f56b  ocaml-ray2.pgm
63a359e5c388f2004726d83d4337f56b  ocaml-ray4.pgm

As you can see, the CL code I originally posted is significantly
faster than the slowest Ocaml code, and just implementing the first
optimization makes it faster than the fastest Ocaml code. I can't
explain the results you're seeing.

[ Sorry for repeating this, but I want to be clear: The code I wrote
  wasn't written within the constraints of Jon's benchmark (though it
  uses the same algorithm), and as such shouldn't be compared to the
  other versions of the code. It's somewhat unfortunate that the
  discussion somehow veered into making those comparisons.
  
  The only reason I'm posting these timings is that Jon claimed the
  code was 20% slower than the slowest Ocaml, while here it's 20%
  faster. ]

-- 
Juho Snellman
"Premature profiling is the root of all evil."
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <432306f7$0$17468$ed2e19e4@ptn-nntp-reader04.plus.net>
I wrote a response to this ages ago but knode died and lost it...

Juho Snellman wrote:
> <······@jdh30.plus.com> wrote:
>> This is closer, of course, but I would still say that the OCaml is
>> significantly more concise. Specifically, the Lisp has 1.9x, 1.2x and
>> 1.7x more bytes, words and lines, respectively.
> 
> Of those, as you've applied them, I think words is the only reasonable
> measure. Bytes might be at least borderline reasonable if you ignored
> leading whitespace. But hey, obviously this point has already been
> beaten to death by others with little effect. If words or
> non-whitespace characters don't get you the results you want, feel
> free to continue measuring "verbosity" using lines.

You can measure non-whitespace-characters-that-aren't-brackets,
tokens-that-aren't-brackets and even code-that-isn't-Lisp if you want. What
ever you measure, the bias will be blatantly obvious if you're setting out
to show that your "ray-mv.lisp" is similar in size to my first "ray.ml".
Frankly, I'm amazed that anyone would even think they were even similar in
size...

>> Are you sure? The fastest Lisp is 1.9x slower than the fourth (the
>> fastest) OCaml here but only 20% slower than the first (the slowest)
>> OCaml. Indeed, none of the Lisp implementations beat even the slowest of
>> the most unoptimised C++, SML or OCaml programs on either of my computers
>> (Athlon t-bird or Athlon64).
>
> Of course I'm sure :-) I get completely different results:

Juho got:

[ray1.ml: 11.71s; ray4.ml: 7.32s;
 ray-mv.lisp: 9.529s; ray-mv-2.lisp: 6.815s
 binary result all identical]

> As you can see, the CL code I originally posted is significantly
> faster than the slowest Ocaml code,

~20% faster.

> and just implementing the first 
> optimization makes it faster than the fastest Ocaml code. I can't
> explain the results you're seeing.

I think it is just AMD vs Intel. I only use AMD and the vast majority of my
work is on AMD64 now. Other people using Intel have reporting similar
results to yours.

On my 900MHz 32-bit Athlon T-bird with level=9 and n=512, best of 3 runs:

ray1.ml: 36.982s; ray4.ml: 23.094s;
ray-mv.lisp: 33.672s

Normalising the Lisp2, OCaml4, Lisp1 and OCaml1 times relative to OCaml4, I
get:

    ?  : 1.00 : 1.46 : 1.60.

and you get:

  0.93 : 1.00 : 1.30 : 1.60.

I don't have "ray-mv-2.lisp" to test but without actually trying it, my
guess is that it'll be slightly slower than OCaml4 on my machine (OCaml
seems to be relatively faster on my machine compared to yours, indeed on
AMD compared to Intel).

On my 1.8GHz Athlon64 with level=9 and n=512:

ray1.ml: 12.524s; ray4.ml: 7.816s;
ray-mv.lisp: 14.715s

or:

    ?  : 1.00 : 1.88 : 1.60

In this case, OCaml is using its AMD64 code gen. It'll be interesting to
retime these when SBCL gets an AMD64 backend. :-)

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Juho Snellman
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <slrndi65ec.jik.jsnell@sbz-31.cs.Helsinki.FI>
<······@jdh30.plus.com> wrote:
> In this case, OCaml is using its AMD64 code gen. It'll be interesting to
> retime these when SBCL gets an AMD64 backend. :-)

That would about 9 months ago.

-- 
Juho Snellman
"Premature profiling is the root of all evil."
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <43231c2c$0$17468$ed2e19e4@ptn-nntp-reader04.plus.net>
Juho Snellman wrote:
> <······@jdh30.plus.com> wrote:
>> In this case, OCaml is using its AMD64 code gen. It'll be interesting to
>> retime these when SBCL gets an AMD64 backend. :-)
> 
> That would about 9 months ago.

Odd that there's no Debian package. I'll try compiling from source...

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <43256b13$0$1311$ed2619ec@ptn-nntp-reader02.plus.net>
Jon Harrop wrote:
> Juho Snellman wrote:
>> <······@jdh30.plus.com> wrote:
>>> In this case, OCaml is using its AMD64 code gen. It'll be interesting to
>>> retime these when SBCL gets an AMD64 backend. :-)
>> 
>> That would about 9 months ago.
> 
> Odd that there's no Debian package. I'll try compiling from source...

Well, I compiled it from source but everything that I've tried so far is
faster in 32-bit mode...

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Juho Snellman
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <slrndiaun3.uhc.jsnell@sbz-30.cs.Helsinki.FI>
<······@jdh30.plus.com> wrote:
> Well, I compiled it from source but everything that I've tried so far is
> faster in 32-bit mode...

Did you forget <························@ptn-nntp-reader02.plus.net>,
where you posted results where 64-bit SBCL was 25% faster than 32-bit
SBCL in running ray-mv.lisp?

But you are quite right in that the 64-bit mode is not faster for
everything. For a lot of code the performance gain from extra
registers is less important than the performance loss from increased
cache pressure and memory bandwidth use. 

-- 
Juho Snellman
"Premature profiling is the root of all evil."
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <432607f5$0$17481$ed2e19e4@ptn-nntp-reader04.plus.net>
Juho Snellman wrote:
> <······@jdh30.plus.com> wrote:
>> Well, I compiled it from source but everything that I've tried so far is
>> faster in 32-bit mode...
> 
> Did you forget <························@ptn-nntp-reader02.plus.net>,
> where you posted results where 64-bit SBCL was 25% faster than 32-bit
> SBCL in running ray-mv.lisp?

I can't access that post so I can't check it. Are you sure I wasn't
comparing 32-bit SBCL on an x86 to 32-bit SBCL on an AMD64?

> But you are quite right in that the 64-bit mode is not faster for
> everything. For a lot of code the performance gain from extra
> registers is less important than the performance loss from increased
> cache pressure and memory bandwidth use.

I'm seeing another sinister problem now. 64-bit is slightly slower for some
programs but, seemingly randomly, is orders of magnitude slower for
others?! I left my most optimised ray tracer (that works perfectly for
level=6 and n=160 and gives much better performance than the old one)
running all night on level=9 and n=512 and it never terminated. Plenty of
spare RAM, no other active processes, exactly the same invocation. I even
tried bootstrapping SBCL.

I'll try again tonight, just to check.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Juho Snellman
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <slrndid63v.tqr.jsnell@sbz-30.cs.Helsinki.FI>
<······@jdh30.plus.com> wrote:
> I can't access that post so I can't check it. Are you sure I wasn't
> comparing 32-bit SBCL on an x86 to 32-bit SBCL on an AMD64?

I'm pretty sure:

<http://groups.google.com/group/comp.lang.lisp/msg/9859b7bd1cb45e55>

> I'm seeing another sinister problem now. 64-bit is slightly slower for some
> programs but, seemingly randomly, is orders of magnitude slower for
> others?! I left my most optimised ray tracer (that works perfectly for
> level=6 and n=160 and gives much better performance than the old one)
> running all night on level=9 and n=512 and it never terminated. 

I haven't seen any other reports of problems like this. Could you
please send a mail with a description of the symptoms, the source
code, and the version of SBCL used to sbcl-devel (at lists.sf.net)?

Thanks,

-- 
Juho Snellman
"Premature profiling is the root of all evil."
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <432347ff$0$1312$ed2619ec@ptn-nntp-reader02.plus.net>
Jon Harrop wrote:
> On my 1.8GHz Athlon64 with level=9 and n=512:
> 
> ray1.ml: 12.524s; ray4.ml: 7.816s;
> ray-mv.lisp: 14.715s
> 
> or:
> 
>     ?  : 1.00 : 1.88 : 1.60
> 
> In this case, OCaml is using its AMD64 code gen. It'll be interesting to
> retime these when SBCL gets an AMD64 backend. :-)

I just installed the 64-bit SBCL and I get:

ray-mv.lisp: 11.290s

    ?  : 1.00 : 1.44 : 1.60

Out of curiosity, I've manually performed a similar amount of unboxing as
"ray-vs.lisp" to the "ray1.ml". The consing drops from 15Gb to 3.5Gb and
the time taken drops to 32.500s on x86 and 7.459s on AMD64.

I'll try writing a camlp4 macro to do the unboxing for me...

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1125760543.625615.23580@z14g2000cwz.googlegroups.com>
Juho Snellman wrote:
> <·············@antenova.com> wrote:
> > I don't think Jon is actually measuring the whitespace lines, just the
> > code lines.
>
> Then I don't know where Jon is getting >3x from. Unless he's comparing
> the CL program to the simplest ML program when talking of size, and
> the most optimized one when talking of speed.

He's comparing the simplest ML program, which is 57 lines long to
yours.

> > I think although some of the short lisp/scheme programs are arguably of
> > similar verbosity to the OCaml program Juho's code above is much more
> > verbose.  All the stuff at the beginning is significant extra meaning,
> > not just extra text.
>
> Oh, sure. It was not written for direct comparisons against the
> programs Jon has on his site. It was basically a fairly mechanical
> transformation on what Nathan wrote, to see what a SSC could do on the
> code.
>
> If you wanted to actually compare a CL program to Jon's programs, you'd
> need to implement all the optimizations that are implemented for the
> other languages, instead of just using the basic version of the tracer.
> For example, implementing the first optimization (better bounding) as
> in <http://www.cs.helsinki.fi/u/jesnellm/tmp/ray-mv-2.lisp> is a 25%
> speedup, after which the CL version runs faster than the most optimized
> Ocaml version on my x86-64. (I don't care enough to implement the other
> optimizations, this one was so easy that it was worth implementing just
> to illustrate the point).

To do a little comparison you don't really need to implement them all.

> And yes, comparing that version to Jon's programs is still pointless
> because it is also significantly more verbose

I think that's Jon's point: to compare the amount of verbosity needed
to reach a certain level of performance.

> (by which I mean 50% more, not 200%).

I actually think 200% is an underestimate.  I read your version
"ray-mv.lisp" last night and it took me ages to understand it.  I still
haven't fully groked the loop statement in "def".

If it were a real program I'd put a comments all over this code and it
would get much longer.  I think the principle could be used with
shorter code though.

> It might be interesting to see how this sort of manual
> unboxing would be implemented in the other languages; in CL it's easy
> to implement with multiple-values, easy to hide the m-v ugliness behind
> macros, and reasonably easy to generate those convenience macros using
> other macros.

Yes.
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <4318cd11$0$22924$ed2619ec@ptn-nntp-reader01.plus.net>
Vesa Karvonen wrote:
> Don't waste your time. All this has been explained to Jon several times
> already. (Try googling for "Jon Harrop" and "verbose":
>   http://groups.google.fi/groups?hl=fi&q=jon+harrop+verbose .)

The fastest Lisp is 3.3x, ~2.2x and 3.0x as verbose as the OCaml in terms of
lines, tokens and chars, respectively.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <4318cd16$0$22924$ed2619ec@ptn-nntp-reader01.plus.net>
Juho Snellman wrote:
> The cost of the temporaries isn't just allocation and GC. There's also
> the cost of moving the floats between memory and registers/x87 stack.

Do you mean OCaml may be unboxing floats into registers when SBCL is boxing
floats into RAM?

>> So my guess is that SBCL's GC needs some work.
> 
> Some of that work has been done during the summer, but not merged into
> the main branch. I'm hoping to do that for 0.9.6, to be released in
> about two months. Anyone wanting to play around with the changes, feel
> free to checkout the gencgc-page-table-branch from SBCL CVS (for
> instructions see <http://sourceforge.net/cvs/?group_id=1373>). It's
> about 20% faster than vanilla SBCL for the consy CL version of the
> raytracer on x86-64.

That is a significant improvement. I'll check it out, thanks.

> It's not 3x as long by any reasonable metric. Lines of code aren't
> reasonable;

If both lines and bytes are "unreasonable", what would you suggest? Lines
without any brackets, perhaps? ;-)

> I could've mashed everything on one line and removed all 
> blank lines like in the C++ and Ocaml versions, but then it would've
> also been as unreadable as the Ocaml and C++.

Your implementation contains lots of code like this:

(defmacro def ((name params &body body)
               (mname &rest mparams)
               (wname &rest wparams))
  `(progn
    (declaim (inline ,name ,wname))
    (defun ,name ,params ,@body)
    (defmacro ,mname ,(mapcar #'car mparams)
      ,(loop with inner = (list name)
             with body = ``,',inner
             with all-names = nil
             for (form count) in (reverse mparams)
             for names = (loop repeat count collect (gensym))
             do
             (setf all-names (append all-names names))
             (setf body ``(multiple-value-bind ,',(reverse names)
                           ,,form ,,body))
             finally
             (setf (cdr inner) (reverse all-names))
             (return body)))
    (defun ,wname ,(mapcar #'car wparams)
      (,mname ,@(mapcar #'cadr wparams)))))

That simply does not appear in any other implementations. So it is not
surprising that the Lisp is much longer.

> For example it's got about 15% more whitespace-separated tokens than the
> Ocaml code. 

Only if you neglect the enormous number of unseparated brackets in the Lisp.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Andras Simon
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <vcd7jdy7cir.fsf@csusza.math.bme.hu>
Jon Harrop <······@jdh30.plus.com> writes:

> Juho Snellman wrote:

[...]

> 
> > For example it's got about 15% more whitespace-separated tokens than the
> > Ocaml code. 
> 
> Only if you neglect the enormous number of unseparated brackets in the Lisp.

Which, of course, you must. Or would you count spaces if you were
measuring a Python program?

Andras


> 
> -- 
> Dr Jon D Harrop, Flying Frog Consultancy
> http://www.ffconsultancy.com
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <431a2ec3$0$22920$ed2619ec@ptn-nntp-reader01.plus.net>
Andras Simon wrote:
>> Only if you neglect the enormous number of unseparated brackets in the
>> Lisp.
> 
> Which, of course, you must.

No.

> Or would you count spaces if you were measuring a Python program?

We're not measuring a Python program.

You'll have to ask Vesa which tokens he wishes to discount in order to get a
Lisp program with 3x as many lines and bytes to appear to be equally
verbose.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: ············@mac.com
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1125734044.865131.262270@g47g2000cwa.googlegroups.com>
Jon Harrop wrote:

> Interestingly, Sun's Java has the same problem as SBCL here - it is very
> slow at allocating and immediately collecting millions of temporary
> vectors, effectively using malloc and free. The fix for the Java code
> (manually inlining all vector calculations) is even more hideous than the
> fix for the CL code.
>

Actually, thats completely wrong. Sun's java has never used malloc and
free to allocate objects. The source is publicly available, look for
yourself.

Sun's young gen gc (what you'd see doing collections in the scenario
you describe) is actually quite speedy.
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <431a2ec4$0$22920$ed2619ec@ptn-nntp-reader01.plus.net>
············@mac.com wrote:
> Sun's young gen gc (what you'd see doing collections in the scenario
> you describe) is actually quite speedy.

If Java's GC is "quite speedy", how do you explain Java being up to 5x
slower than other GC'd languages and it regaining most of that lost
performance when vector calculations are manually inlined?

It seems to me that Java's GC is much slower for this task than GCs found in
most FPLs, most likely because the rapid allocation and collection of 3D
vectors in this case is equivalent to typical use in functional programming
(e.g. symbolics) but is untypical in Java.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: ············@mac.com
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1125792809.362699.208460@g47g2000cwa.googlegroups.com>
> If Java's GC is "quite speedy", how do you explain Java being up to 5x
> slower than other GC'd languages and it regaining most of that lost
> performance when vector calculations are manually inlined?
>

Perhaps your Java implementation is generating 5x more garbage. Perhaps
you've written your Java version so the temporary objects cannot be
collected in the young gen. Perhaps your working set size vs the young
gen size is such that objects are being tenured when they should not
be. I don't know the answer.

Your earlier guess that the GC was using free and malloc tells me you
could gain a LOT of performance by learning some basics and tuning
after that.

You may even find that GC wasn't the issue at all. What percentage of
time is your program spending in GC vs actually running?
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <431a4eae$0$97126$ed2619ec@ptn-nntp-reader03.plus.net>
············@mac.com wrote:
>> If Java's GC is "quite speedy", how do you explain Java being up to 5x
>> slower than other GC'd languages and it regaining most of that lost
>> performance when vector calculations are manually inlined?
> 
> Perhaps your Java implementation is generating 5x more garbage.

If my interpretation of the verbose GC output is correct then the Java is
generating the same amount of garbage as the OCaml (10Gb).

> Perhaps 
> you've written your Java version so the temporary objects cannot be
> collected in the young gen.

I think it is the same as my other implementations in this respect.

> Perhaps your working set size vs the young 
> gen size is such that objects are being tenured when they should not
> be.

Unless Sun chose pathologically bad magic numbers and no adaption for their
GC, you would expect the performance to vary wildly with the different
program variations and problem sizes. In fact, the relative performance
shows very little variation. That indicates to me that I haven't just
picked a bad set of numbers for Java but, rather, that Java is just slow.

> I don't know the answer. 

Neither do I (this is the only Java program I have ever written) but my Java
programs were peer reviewed on comp.lang.java.programmer and the only fix
proposed that I found to work was to manually inline all vector
calculations.

> Your earlier guess that the GC was using free and malloc

I said "effectively" to mean: "Java's GC is so slow that it might as well be
using malloc and free".

> tells me you 
> could gain a LOT of performance by learning some basics and tuning
> after that.

Of course. That is equally true for all of the other languages and
implementations. This exercise was never intended to be an exercise in
tuning - I wanted to find out how efficient and verbose the languages are
when fed simple, obvious implementations.

Someone out there may be interested in optimising my Java implementation but
I am far more interested in learning about the faster and more succinct
languages.

> You may even find that GC wasn't the issue at all. What percentage of
> time is your program spending in GC vs actually running?

According to the verbose GC output obtained with:

  java -Xloggc:log ray 6 160 >image.pgm

Java spends 7.4x longer in its GC than OCaml does.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: ············@mac.com
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1125800141.363932.175640@f14g2000cwb.googlegroups.com>
Jon Harrop wrote:

> Of course. That is equally true for all of the other languages and
> implementations. This exercise was never intended to be an exercise in
> tuning - I wanted to find out how efficient and verbose the languages are
> when fed simple, obvious implementations.
>
> Someone out there may be interested in optimising my Java implementation but
> I am far more interested in learning about the faster and more succinct
> languages.

Ah. You should have said you weren't interested in benchmarking,
but justifying your particular choice of language.

>
> According to the verbose GC output obtained with:
>
>   java -Xloggc:log ray 6 160 >image.pgm
>
> Java spends 7.4x longer in its GC than OCaml does.
>

Okay, so on your test it does. The question is *why*.

Is your OCaml heap the same size as the Java heap? Do they use the
same collection strategies? If they both have an "eden" in which new
objects are allocated, are they the same size?

I'd want to understand what was happening before I wrote off an
entire language as "slower and more verbose".

You could even take GC out of the equation. Set both heaps
to 20GB or so, and neither should need to GC at all.
From: ············@mac.com
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1125803653.998133.184280@g44g2000cwa.googlegroups.com>
Jon Harrop wrote:

> If my interpretation of the verbose GC output is correct then the Java is
> generating the same amount of garbage as the OCaml (10Gb).

Well, by my measurements, you're creating less than 700MB of total
data.

> (...) That indicates to me that I haven't just
> picked a bad set of numbers for Java but, rather, that Java is just slow.

 (...)

> Java spends 7.4x longer in its GC than OCaml does.
>

I downloaded and ran your test.

Here's what I see (shortened to prevent line breaks):

java -XX:+TraceGen0Time -XX:+TraceGen1Time ray 6 160 > image.pgm
[(...) gen 0 time 0.1336180 secs, 1283 GC's, avg GC time 0.0001041]
[(...) gen 1 time 0.0000000 secs, 0 GC's, avg GC time nan]

Hmm, almost 1300 GC's! I bet the young gen is too small.

java -Xmx900M -XX:NewSize=830M -XX:+TraceGen0Time
        -XX:+TraceGen1Time ray 6 160 > image.pgm
[(...) gen 0 time 0.0000000 secs, 0 GC's, avg GC time nan]

Okay, there's zero time in GC. And *way* less than 10GB of garbage.

This entire test took only a couple of seconds to run, and that's
barely enough to qualify it as a toy benchmark.

You *really* need to understand what you're measuring and why.
I don't think you've put effort into this beyond "Is OCaml faster?
Good, end of benchmark!"
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <431a6bed$0$1311$ed2619ec@ptn-nntp-reader02.plus.net>
············@mac.com wrote:
> Jon Harrop wrote:
>> If my interpretation of the verbose GC output is correct then the Java is
>> generating the same amount of garbage as the OCaml (10Gb).
> 
> Well, by my measurements, you're creating less than 700MB of total
> data.

Presumably I was talking about arguments (9, 512) and you are talking about
(6, 160), so that isn't surprising. I get roughly 1Gb for 6 and 160,
depending on the architecture and language.

> I downloaded and ran your test.
> 
> Here's what I see (shortened to prevent line breaks):
> 
> java -XX:+TraceGen0Time -XX:+TraceGen1Time ray 6 160 > image.pgm
> [(...) gen 0 time 0.1336180 secs, 1283 GC's, avg GC time 0.0001041]
> [(...) gen 1 time 0.0000000 secs, 0 GC's, avg GC time nan]
> 
> Hmm, almost 1300 GC's! I bet the young gen is too small.

For 6 and 160, OCaml does 5524 minor collections and 9 major collections.
For 9 and 512, OCaml does 64229 minor and 8 major collections.

The latter test is more interesting because it takes ~30s to run. Hence I
used it for the language comparison on my site.

For 9 and 512, Java tells me 1145 gen 0 and 4 gen 1 collections taking a
total of 0.96s whereas OCaml's GC does many more collections but takes a
total of only 0.16s.

> java -Xmx900M -XX:NewSize=830M -XX:+TraceGen0Time
>         -XX:+TraceGen1Time ray 6 160 > image.pgm
> [(...) gen 0 time 0.0000000 secs, 0 GC's, avg GC time nan]
> 
> Okay, there's zero time in GC.

As I said, this test is so short that I have little faith is Java's
profiling. It is also worth noting that this triples the running time on my
machine despite it not going to swap.

> And *way* less than 10GB of garbage. 

You used the wrong command-line arguments.

> This entire test took only a couple of seconds to run, and that's
> barely enough to qualify it as a toy benchmark.

Use 9 and 512.

> You *really* need to understand what you're measuring and why.

I'm measuring the performance of a simple program to see how long much
effort it takes to write and how efficiently the compiler compiles it.

> I don't think you've put effort into this...

If you want to put more effort in then please do. Why do you think Java is
so slow and what can you do to speed it up?

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Ray Dillinger
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <FEFSe.11988$p%3.48185@typhoon.sonic.net>
············@mac.com wrote:

> Your earlier guess that the GC was using free and malloc tells me you
> could gain a LOT of performance by learning some basics and tuning
> after that.

I'm still a bit of a novice, because I've only ever written one
garbage collector, and that for a system that has other reasons
why its performance is expected to be slow...  but my collector
uses malloc and free at the level of individual objects, because:

1)  I had read that most of the sloth problem of malloc and
     free had been ironed out in later versions but the reputation
     for sloth lived on - which I'm more than willing to believe
     considering the history of Lisp itself.

2)  All the easy ways to do it without malloc and free have some
     annoying behaviors; either the system's arena can't grow with
     increasing program heapsize, or it can't shrink with decreasing
     program heapsize, or the GC can't be managed with ultra-low
     latency because it might require a total heap copy.  Since my
     goal was to strictly limit the number of microseconds spent on
     each GC call, copying the entire heap to a larger/smaller arena
     was not an option.

3)  I wanted stable pointers to objects:  That is, C code
     referencing the objects by pointer need not worry that
     the object has been moved by the GC. Yes, I know that
     fragmentation is a performance problem, but for now it's
     something I'm living with.

Now I realize that there are hard ways to do what I want; I
could have the GC run multiple memory arenas and use more
complex ideas of object identity and more preprocessing of
pointers.  I just used the bottom 3 bits for a typetag since
valid pointers from malloc always come aligned on 8-byte
boundaries on my system; if the bottom 3 bits are 0, therefore,
it's a pointer.  Otherwise it's a typetag for an unboxed 29-bit
value.

But my ultra-simple garbage collector (~300 lines of C) does
basically what I want.  Stable pointers for the life of the
object, growth and shrinkage of heap usage with actual program
requirements so my programs can play nice with others, no
arbitrary program heap size limits short of the 2GB
addressable by the machine, and ultra-low latency.

Do I really still lose substantial performance to the use of
the OS's malloc and free?  Or is that just a lingering myth?
Are the higher-performance methods consistent with my other
goals?

			Bear
From: ············@mac.com
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1125865564.775338.237880@o13g2000cwo.googlegroups.com>
Ray Dillinger wrote:

>
> Do I really still lose substantial performance to the use of
> the OS's malloc and free?  Or is that just a lingering myth?
> Are the higher-performance methods consistent with my other
> goals?
>

Given the goals you've stated:

1) Objects do not move during their lifetime
2) GC latency is more important than throughput
3) Code simplicity is very important

malloc and free aren't a terrible choice. You can certainly do better,
but even the first step is a big one. A simple non moving collector
will be at least 10x the code you have now. I'd run with what you've
got until it becomes a bottleneck.
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <431b3113$0$1289$ed2619ec@ptn-nntp-reader02.plus.net>
Ray Dillinger wrote:
> Do I really still lose substantial performance to the use of
> the OS's malloc and free?  Or is that just a lingering myth?
> Are the higher-performance methods consistent with my other
> goals?

It depends on the usage of your GC. If it will be used to automate "delete"
in C++ then using malloc and free is fine. But if it will be used to
allocate and deallocate all values (e.g. for a FPL) then it is likely to be
very slow.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Ray Dillinger
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <ROGSe.12006$p%3.48163@typhoon.sonic.net>
Jon Harrop wrote:
> Ray Dillinger wrote:
> 
>>Do I really still lose substantial performance to the use of
>>the OS's malloc and free?  Or is that just a lingering myth?
>>Are the higher-performance methods consistent with my other
>>goals?
> 
> 
> It depends on the usage of your GC. If it will be used to automate "delete"
> in C++ then using malloc and free is fine. But if it will be used to
> allocate and deallocate all values (e.g. for a FPL) then it is likely to be
> very slow.

That's exactly what I'm using it for.  It's a toy-lisp
implementation, and I use the heap a *lot*.  Need a floating
point number? It's larger than 32 bits, so I allocate one
and store a pointer to it.  Invoke a procedure?  Allocate a
call frame on the heap.  Returning a bignum?  Allocate it
and return a pointer to it.  Each of these is a call to
malloc (and eventually, at the end of the object's lifetime,
free).

Self tail calls I've optimized into gotos, so they don't
continually allocate and collect call frames; but I haven't
gotten to mutual tail recursions etc yet; they still produce
and eventually collect frames on every call.

So what ought I be doing to avoid this speed penalty for
native OS calls, and why aren't the native OS calls doing
it?

			Bear
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <431b946f$0$1321$ed2619ec@ptn-nntp-reader02.plus.net>
Ray Dillinger wrote:
> Self tail calls I've optimized into gotos, so they don't
> continually allocate and collect call frames; but I haven't
> gotten to mutual tail recursions etc yet; they still produce
> and eventually collect frames on every call.

Allocating and "collecting" (deallocating) frames is usually significantly
simpler and, consequently, faster than heap allocation and GC collection.
This is why non-tail recursive functions are often faster than tail
recursive functions for small depths of recursion.

> So what ought I be doing to avoid this speed penalty for
> native OS calls, and why aren't the native OS calls doing
> it?

The native OS calls aren't "doing it" for two main reasons:

1. They are not designed for this use (this distribution of object
lifetimes).

2. They cannot be optimised to the same extent as, for example, the
allocators emitted by ocamlopt alongside generated native-code.

The simplest way to address this issue is to switch to C++ and use STL
containers instead of malloc and free. This is actually likely to make your
code shorter rather than longer and the STL's built-in allocators are
likely to outperform malloc and free because they are better optimised for
this kind of use.

The next step would be to write your own allocation and deallocation
routines on top of malloc and free (or new and delete). I've never done
this but I'd recommend reading up on GC papers and asking GC experts which
books they recommend. You can pretty much make this task as difficult as
you want. I think it would actually be quite fun (it's on my "to do" list)
because you can measure the distributions of lifetimes and sizes of values
and play around with multiple allocators or adaptive allocation and so on.

Finally, when your compiler is emitting highly optimised native code you'll
want to hand-optimise your allocator and carefully optimise your garbage
collector. That's probably a few years away... ;-)

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <431c7208$0$17482$ed2e19e4@ptn-nntp-reader04.plus.net>
As a little test, I spent last night toying with the C++ implementation of
my ray tracer. This is a good program to use because it has a similar
distribution of 3D vector lifetimes to values in typical FP programs rather
than imperative programs. However, my results weren't all as I'd
expected...

Using new and delete with reference counting, the program slows down by
about an order of magnitude. Some of that slow-down can be attributed to
the overhead of reference counting. So I was right that new and delete are
substantially slower than real GCs.

Jon Harrop wrote:
> The simplest way to address this issue is to switch to C++ and use STL
> containers instead of malloc and free. This is actually likely to make
> your code shorter rather than longer and the STL's built-in allocators are
> likely to outperform malloc and free because they are better optimised for
> this kind of use.

However, this suggestion of mine was grossly off the money. It seems that
improving upon new and delete isn't as easy as I had assumed, particularly
using the STL.

I was quite surprised when replacing my calls to new and delete with
inserting and deleting from an STL list that my program slowed down by an
order of magnitude. Worse, the STL never reclaimed the memory so it ran out
of memory on my usual test.

> The next step would be to write your own allocation and deallocation
> routines on top of malloc and free (or new and delete). I've never done
> this but I'd recommend reading up on GC papers and asking GC experts which
> books they recommend. You can pretty much make this task as difficult as
> you want. I think it would actually be quite fun (it's on my "to do" list)
> because you can measure the distributions of lifetimes and sizes of values
> and play around with multiple allocators or adaptive allocation and so on.

I've had a quick play with writing a custom allocator and deallocator using
the STL and all of my results are much slower than using new and delete.

I tried having young and old generations, measuring the distributions of
object lifetimes in terms of both real-time and references made. I tried
linearly searched STL vectors of objects, keeping a balanced binary tree
(STL set) of free elements, and keeping a stack of free elements. None of
these approaches comes close to the performance of using new and delete.

One question remains though - are new and delete "as good as" malloc and
free? I've yet to try malloc and free but I expect they are exactly the
same...

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Joe Marshall
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <psrnl5hc.fsf@alum.mit.edu>
Jon Harrop <······@jdh30.plus.com> writes:

> As a little test, I spent last night toying with the C++ implementation of
> my ray tracer. This is a good program to use because it has a similar
> distribution of 3D vector lifetimes to values in typical FP programs rather
> than imperative programs. However, my results weren't all as I'd
> expected...
>
> Using new and delete with reference counting, the program slows down by
> about an order of magnitude. Some of that slow-down can be attributed to
> the overhead of reference counting. So I was right that new and delete are
> substantially slower than real GCs.

I'm not surprised.  Reference counting (as it is usually done) is a
terrible way to GC.  You have to do all sorts of tricks to get it to
perform well.
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <431ca222$0$17482$ed2e19e4@ptn-nntp-reader04.plus.net>
Joe Marshall wrote:
> Jon Harrop <······@jdh30.plus.com> writes:
>> Using new and delete with reference counting, the program slows down by
>> about an order of magnitude. Some of that slow-down can be attributed to
>> the overhead of reference counting. So I was right that new and delete
>> are substantially slower than real GCs.
>
> I'm not surprised.  Reference counting (as it is usually done) is a
> terrible way to GC.  You have to do all sorts of tricks to get it to
> perform well.

Yes. I don't want to measure the overhead of reference counting though. I'm
trying to estimate the overhead of using malloc and free for all GC
allocations and deallocations.

The ray tracer takes 5.6s to run normally. Allocating and collecting all 3D
vectors using malloc and free with reference counting, it takes 14s. If I
allocate and deallocate twice for each vector, it takes 22s. Assuming all
other things to be equal (which they're not quite) then malloc and free are
taking 22-14=8s of the time. So 8/14=60% of the time is spent in malloc and
free.

The next question is how much can we reduce that 60%. Given that several GCd
languages match the performance of C++, they must have much more efficient
collection schemes than my simple reference counter but they must also have
much more efficient allocators and deallocators than malloc and free. Does
that sound reasonable?

So Ray might be able to double the speed of his language just by optimising
the allocator and deallocator. That's optimistic, but I think there are
significant gains to be made here without a huge amount of work.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Joe Marshall
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <ll2bkren.fsf@alum.mit.edu>
Jon Harrop <······@jdh30.plus.com> writes:

> Given that several GCd languages match the performance of C++, they
> must have much more efficient collection schemes than my simple
> reference counter but they must also have much more efficient
> allocators and deallocators than malloc and free. Does that sound
> reasonable?

It's plausible.  A simple first-fit malloc that interlocked on all
allocations (it has to be thread-safe) would perform worse than an
allocator that only has to bump a thread-local frontier pointer.

Malloc and free, of course, take time proportional to the total number
of structures allocated.  Depending on the GC algorithm, a flip may
take time proportional to the amount of live storage, proportional to
the size of the smallest generation, or proportional to the amount
collected.

In my experience, a properly tuned generational copying GC typically
has very little overhead.

~jrm
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126043259.936961.29340@f14g2000cwb.googlegroups.com>
Ray Dillinger wrote:
> Jon Harrop wrote:
> > Ray Dillinger wrote:
> >
> >>Do I really still lose substantial performance to the use of
> >>the OS's malloc and free?  Or is that just a lingering myth?
> >>Are the higher-performance methods consistent with my other
> >>goals?
> >
> >
> > It depends on the usage of your GC. If it will be used to automate "delete"
> > in C++ then using malloc and free is fine. But if it will be used to
> > allocate and deallocate all values (e.g. for a FPL) then it is likely to be
> > very slow.
>
> That's exactly what I'm using it for.  It's a toy-lisp
> implementation, and I use the heap a *lot*.  Need a floating
> point number? It's larger than 32 bits, so I allocate one
> and store a pointer to it.  Invoke a procedure?  Allocate a
> call frame on the heap.  Returning a bignum?  Allocate it
> and return a pointer to it.  Each of these is a call to
> malloc (and eventually, at the end of the object's lifetime,
> free).
>
> Self tail calls I've optimized into gotos, so they don't
> continually allocate and collect call frames; but I haven't
> gotten to mutual tail recursions etc yet; they still produce
> and eventually collect frames on every call.
>
> So what ought I be doing to avoid this speed penalty for
> native OS calls, and why aren't the native OS calls doing
> it?

In addition to what Jon said, it's useful to understand why GC is not
fast for small groups of objects.  Most mallocs use a set of bins so
for objects between some size range the allocator will pick an
appropriate bin.  Each object in the bin is of the same size, and that
size is the amount you get back (often, no matter what you asked for).
There may be a free list for each bin, or some other more complex
structure.  The allocator may create new bins for sizes your program
often uses.

All this stuff is quite complicated, it's made to deal with programs
written without GC that do not allocate many very small objects.  C
programs often do many small allocations, but not many as small as say
6 bytes.  As a result it may not be too fast at this.  Also the
smallest bin or second smallest bin may be rather big.

It also worth mentioning that there are still some mallocs are still
really horrible.

(BTW Malloc is a library call, not normally a system call)
From: Joe Marshall
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <br38sbl1.fsf@alum.mit.edu>
Ray Dillinger <····@sonic.net> writes:

> I'm still a bit of a novice, because I've only ever written one
> garbage collector, and that for a system that has other reasons
> why its performance is expected to be slow...  but my collector
> uses malloc and free at the level of individual objects...

> Do I really still lose substantial performance to the use of
> the OS's malloc and free?  Or is that just a lingering myth?
> Are the higher-performance methods consistent with my other
> goals?

It's hard to say without knowing more about the implementation of
malloc and free.  There are some methods consistent with your other
goals, but they may or may not improve performance.  As an example,
you might want to manage pools of cons cells yourself rather than use
malloc and free.

You might try substituting the Boehm GC for your GC and seeing how
they compare.
From: Ray Dillinger
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <PLTSe.12121$p%3.48576@typhoon.sonic.net>
Joe Marshall wrote:
> Ray Dillinger <····@sonic.net> writes:

>>Do I really still lose substantial performance to the use of
>>the OS's malloc and free?  Or is that just a lingering myth?
>>Are the higher-performance methods consistent with my other
>>goals?

> It's hard to say without knowing more about the implementation of
> malloc and free.  There are some methods consistent with your other
> goals, but they may or may not improve performance.  As an example,
> you might want to manage pools of cons cells yourself rather than use
> malloc and free.

Yeah, I've looked at that, and will probably do it eventually;
My little collector has a space overhead of 5 words per object
which is really nasty when the objects themselves (like cons
cells) are only 2 words.  That and a float pool are probably
the easiest performance gains I can create.

> You might try substituting the Boehm GC for your GC and seeing how
> they compare.

I don't think it would be an easy drop-in switch; IIRC, the
Boehm collector doesn't handle weak pointers, so my symbol
table would prevent out-of-scope symbols from ever getting
reaped and, in a program that does symbol creation at runtime,
that means a memory leak.  Or is that information out of
date?

				Bear
From: Joe Marshall
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <u0gzl5py.fsf@alum.mit.edu>
Ray Dillinger <····@sonic.net> writes:

> I don't think it would be an easy drop-in switch; IIRC, the
> Boehm collector doesn't handle weak pointers, so my symbol
> table would prevent out-of-scope symbols from ever getting
> reaped and, in a program that does symbol creation at runtime,
> that means a memory leak.  Or is that information out of
> date?

There isn't an API call for weak pointers, but they are easily
implemented using `atomic' objects and weak links.
From: Ulrich Hobelmann
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <3o3g9oF434h9U1@individual.net>
Ray Dillinger wrote:
> 3)  I wanted stable pointers to objects:  That is, C code
>     referencing the objects by pointer need not worry that
>     the object has been moved by the GC. Yes, I know that
>     fragmentation is a performance problem, but for now it's
>     something I'm living with.

The latter one is a myth, too: 
http://citeseer.ist.psu.edu/johnstone97memory.html

Well, if you only have tiny and big objects, and always allocate small 
objects where the big ones were freed, then you could end up with lots 
of fragmentation swiss cheese.  With a reasonable memory manager (maybe 
two different arenas for smallish objects and big chunks) there is no 
problem.

> Do I really still lose substantial performance to the use of
> the OS's malloc and free?  Or is that just a lingering myth?

Probably a hand-written malloc is faster, but not much.  I think most 
GCs' malloc() is just a check and pointer-increment, while a typical 
malloc is a non-tiny function.

Of course your implementation might be fast enough, which is fast enough.

-- 
My ideal for the future is to develop a filesystem remote interface
(a la Plan 9) and then have it implemented across the Internet as
the standard rather than HTML.  That would be ultimate cool.
	Ken Thompson
From: MSCHAEF.COM
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1fKdneY3jKRuQ4DeRVn-qA@io.com>
In article <·····················@typhoon.sonic.net>,
Ray Dillinger  <····@sonic.net> wrote:
>············@mac.com wrote:
   ...
>I'm still a bit of a novice, because I've only ever written one
>garbage collector, 

Well, I've written zero, so I think that makes me even more of a novice, 
so take what I say with the appropriately sized grain of salt. :-)

>Do I really still lose substantial performance to the use of
>the OS's malloc and free?  Or is that just a lingering myth?

I've spent a fair amount of time playing around with and extending
SIOD (including taking some measurements with a profiler). SIOD's GC,
when compiled with Visual C++ 2003 (no optimizations) suffers pretty
badly when it has to free lots of malloc'ed storage. SIOD's sstring 
representation heap allocates buffers for strings, and whenever I have 
lots of free strings on the heap, the GC free phase is several orders of 
magnitude longer than otherwise.

-Mike
-- 
http://www.mschaef.com
From: Ray Dillinger
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <iZsTe.12419$p%3.50154@typhoon.sonic.net>
MSCHAEF.COM wrote:
> 
> I've spent a fair amount of time playing around with and extending
> SIOD (including taking some measurements with a profiler). SIOD's GC,
> when compiled with Visual C++ 2003 (no optimizations) suffers pretty
> badly when it has to free lots of malloc'ed storage. SIOD's sstring 
> representation heap allocates buffers for strings, and whenever I have 
> lots of free strings on the heap, the GC free phase is several orders of 
> magnitude longer than otherwise.

SIOD uses a naive tracing collector that stops the program
while GC is done (correct me if this is no longer true) so
the amount of time required by the GC phase will be at least
proportional to the number of live items on the heap.  This
gets impractical for programs with large datasets.

But that aside, "orders of magnitude" (100x or more) sounds
wrong.

Does it behave as though a call to free() requires time
proportional to or polynomial in the number of items on
the heap?  Because if it does, then my limited-latency
calculations, which depended on malloc and free taking
time logarithmic to the number of items on the heap, are
out the window and I need to redesign.

				Bear
From: Ulrich Hobelmann
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <3o7mavF4jghfU1@individual.net>
Ray Dillinger wrote:
> Does it behave as though a call to free() requires time
> proportional to or polynomial in the number of items on
> the heap?  Because if it does, then my limited-latency
> calculations, which depended on malloc and free taking
> time logarithmic to the number of items on the heap, are
> out the window and I need to redesign.

I don't think it takes that much.  A typical free() only looks up the 
data of the chunk you're freeing, probably coalesces it with the two 
neighboring chunks (if they are freed holes) into one bigger chunk, puts 
it on the right free-list or whatever and then returns.  This should be 
close to constant time.

-- 
My ideal for the future is to develop a filesystem remote interface
(a la Plan 9) and then have it implemented across the Internet as
the standard rather than HTML.  That would be ultimate cool.
	Ken Thompson
From: MSCHAEF.COM
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <t6-dnUB3F9U4D7zeRVn-gw@io.com>
In article <·····················@typhoon.sonic.net>,
Ray Dillinger  <····@sonic.net> wrote:
>MSCHAEF.COM wrote:
>> 
>> I've spent a fair amount of time playing around with and extending
>> SIOD (including taking some measurements with a profiler). SIOD's GC,
>> when compiled with Visual C++ 2003 (no optimizations) suffers pretty
>> badly when it has to free lots of malloc'ed storage. 
   ...
>SIOD uses a naive tracing collector that stops the program
>while GC is done (correct me if this is no longer true)
   ...
>But that aside, "orders of magnitude" (100x or more) sounds
>wrong.
    ...
>Does it behave as though a call to free() requires time
>proportional to or polynomial in the number of items on
>the heap? 

I've spent a little time trying to reproduce what I saw earlier. While 
I've seen the behavior during those attempts, I can't get it consistent 
enough to be worth posting anything else too declarative about it. There's 
probably some behind the scenes thing the MSVC runtime is doing that I 
don't know about.

That said, the common case I'm seeing now looks like you probably 
expected (MSVC 2003, Release Mode):

  (gc)(%stress-lisp-heap 500000)(gc)

  [GC @ T+118.617358: 0.052905s, 3060132 cells]
  [GC @ T+118.686952: 0.051011s, 3060132 cells]

  (gc)(%stress-c-heap 500000 3)(gc)

  [GC @ T+127.195295: 0.052422s, 3060132 cells]
  [GC @ T+127.406807: 0.181771s, 3060132 cells]

The %stress-*-heap functions are functions (written in C to avoid consing 
for environments, etc.) that allocate unreferenced stuff on the heaps. 
Objects on the 'lisp heap' are allocated by pulling the front element off 
of a free list. Objects on the 'C heap' are vectors: they're composed of a 
header off of the 'lisp heap' and a (sizeof(void *)*3) byte allocation off 
of the malloc/free heap. In this case, the time taken by the GC (the 
second number in the [GC ...] messages) triples with the additional burden 
of calling free() on heap objects.

As far as how it scales with the number of malloc'ed objects:

   (dotimes n 10
      (format #t "n=~a, " (* 200000 n))
      (dynamic-let ((*info* #f)) (gc))
      (%stress-c-heap (* 200000 n) 3)(gc))


   n=0, [GC @ T+856.263463: 0.051111s, 3059847 cells]
   n=200000, [GC @ T+856.431754: 0.101368s, 3059843 cells]
   n=400000, [GC @ T+856.718321: 0.159537s, 3059843 cells]
   n=600000, [GC @ T+857.120419: 0.214145s, 3059843 cells]
   n=800000, [GC @ T+857.643623: 0.262312s, 3059843 cells]
   n=1000000, [GC @ T+858.264045: 0.311670s, 3059843 cells]
   n=1200000, [GC @ T+858.993757: 0.361081s, 3059843 cells]
   n=1400000, [GC @ T+859.838093: 0.413228s, 3059843 cells]
   n=1600000, [GC @ T+860.795151: 0.466172s, 3059843 cells]
   n=1800000, [GC @ T+861.876438: 0.517457s,

That looks pretty linear to me, with every 200,000 objects adding about 
50ms of time to the GC.

FWIW, when I did see the behavior I spoke of earlier, I was getting GC 
times in the 20-30 second range. I did the math and was seeing around an 
x800-900 difference between the Lisp heap and C heap cases.

-Mike
-- 
http://www.mschaef.com
From: André Thieme
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <dfansb$lik$1@ulric.tng.de>
Jon Harrop schrieb:
> Rob Thorpe wrote:
> 
>>Juho Snellman wrote:
>>
>>>No. The consing is coming from creating structures for the return
>>>values unitize, v+, v- and v*. Most of these structures are temporary,
>>>so a sufficently smart compiler could optimize them away. or at least
>>>allocate them on the stack. Unfortunately CMUCL and SBCL aren't smart
>>>enough.
>>
>>Ah, that's the problem, that makes sense.
> 
> 
> I'm not sure that is the whole picture though. In comparison, I don't
> believe ocamlopt will optimise the temporary allocation away and (with no
> inlining: "-inline 0") ocamlopt is still much faster than SBCL, taking
> 21.7s instead of 17.4s with inlining (-inline 100). This indicates to me
> that inlining is one possible solution but a better solution is to have a
> GC that is much faster at allocating and immediately collecting
> temporaries. So my guess is that SBCL's GC needs some work.

Do you have an idea how much ram would be needed to run the program 
(without producing errors) with the GC turned off?


Andr�
-- 
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <43198fd3$0$17507$ed2e19e4@ptn-nntp-reader04.plus.net>
Andr� Thieme wrote:
> Jon Harrop schrieb:
>> I'm not sure that is the whole picture though. In comparison, I don't
>> believe ocamlopt will optimise the temporary allocation away and (with no
>> inlining: "-inline 0") ocamlopt is still much faster than SBCL, taking
>> 21.7s instead of 17.4s with inlining (-inline 100). This indicates to me
>> that inlining is one possible solution but a better solution is to have a
>> GC that is much faster at allocating and immediately collecting
>> temporaries. So my guess is that SBCL's GC needs some work.
> 
> Do you have an idea how much ram would be needed to run the program
> (without producing errors) with the GC turned off?

OCaml does not have its GC turned off (its memory use is much less than that
of the SBCL), it is simply more efficient at collecting temporaries.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1125759851.901293.254810@f14g2000cwb.googlegroups.com>
Jon Harrop wrote:
> André Thieme wrote:
> > Jon Harrop schrieb:
> >> I'm not sure that is the whole picture though. In comparison, I don't
> >> believe ocamlopt will optimise the temporary allocation away and (with no
> >> inlining: "-inline 0") ocamlopt is still much faster than SBCL, taking
> >> 21.7s instead of 17.4s with inlining (-inline 100). This indicates to me
> >> that inlining is one possible solution but a better solution is to have a
> >> GC that is much faster at allocating and immediately collecting
> >> temporaries. So my guess is that SBCL's GC needs some work.
> >
> > Do you have an idea how much ram would be needed to run the program
> > (without producing errors) with the GC turned off?
>
> OCaml does not have its GC turned off (its memory use is much less than that
> of the SBCL), it is simply more efficient at collecting temporaries.

Without knowing the amount of memory the OCaml version allocated and
recycled it's impossible to know this for certain.

The return values of the lisp code may be quite different to those of
the OCaml, they will store typing for a start and may be aligned in a
different way.  It's also possible that things that would require
boxing in lisp do not in ML.
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <431a31f3$0$22920$ed2619ec@ptn-nntp-reader01.plus.net>
Rob Thorpe wrote:
> Jon Harrop wrote:
>> OCaml does not have its GC turned off (its memory use is much less than
>> that of the SBCL), it is simply more efficient at collecting temporaries.
> 
> Without knowing the amount of memory the OCaml version allocated and
> recycled it's impossible to know this for certain.
> 
> The return values of the lisp code may be quite different to those of
> the OCaml, they will store typing for a start and may be aligned in a
> different way.  It's also possible that things that would require
> boxing in lisp do not in ML.

Very true. Here is the output of the OCaml GC for level=9, n=512 and ss=4:

minor_words: 3074514963
promoted_words: 3676056
major_words: 3676065
minor_collections: 93826
major_collections: 10
heap_words: 3072000
heap_chunks: 50
top_heap_words: 3072000
live_words: 2785889
live_blocks: 634798
free_words: 286096
free_blocks: 341
largest_free: 61440
fragments: 15
compactions: 0

If I am reading that correctly then minor_words*4 = 10Gb was recycled on
x86. On AMD64 I get 17Gb.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126040970.724431.190750@g44g2000cwa.googlegroups.com>
Jon Harrop wrote:
> Rob Thorpe wrote:
> > Jon Harrop wrote:
> >> OCaml does not have its GC turned off (its memory use is much less than
> >> that of the SBCL), it is simply more efficient at collecting temporaries.
> >
> > Without knowing the amount of memory the OCaml version allocated and
> > recycled it's impossible to know this for certain.
> >
> > The return values of the lisp code may be quite different to those of
> > the OCaml, they will store typing for a start and may be aligned in a
> > different way.  It's also possible that things that would require
> > boxing in lisp do not in ML.
>
> Very true. Here is the output of the OCaml GC for level=9, n=512 and ss=4:
>
> minor_words: 3074514963
> promoted_words: 3676056
> major_words: 3676065
> minor_collections: 93826
> major_collections: 10
> heap_words: 3072000
> heap_chunks: 50
> top_heap_words: 3072000
> live_words: 2785889
> live_blocks: 634798
> free_words: 286096
> free_blocks: 341
> largest_free: 61440
> fragments: 15
> compactions: 0
>
> If I am reading that correctly then minor_words*4 = 10Gb was recycled on
> x86. On AMD64 I get 17Gb.

That is a good GC.  I'm surprised that a GC can be that good.  The
problem with the GC in CMUCL and SBCL is that it only works well in
this case if used very infrequently.  So the fastest times are had by
GCing very infrequently.  This ideally shouldn't be the case, since
doing it like this puts a lot of strain on the memory subsystem.  It
indicates it may be taking longer over the mark phase than the ocamlopt
GC.

(The reason I doubted it was the GC is that in lisps very few types of
object can contain free pointers, so GC is theoretically easier than in
many other languages.)
From: James McIlree
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126055548.956848.112790@g49g2000cwa.googlegroups.com>
Rob Thorpe wrote:
>
> That is a good GC.  I'm surprised that a GC can be that good.  The
> problem with the GC in CMUCL and SBCL is that it only works well in
> this case if used very infrequently.  So the fastest times are had by
> GCing very infrequently.  This ideally shouldn't be the case, since
> doing it like this puts a lot of strain on the memory subsystem.  It
> indicates it may be taking longer over the mark phase than the ocamlopt
> GC.
>
> (The reason I doubted it was the GC is that in lisps very few types of
> object can contain free pointers, so GC is theoretically easier than in
> many other languages.)

I'm making a WAG here, but I suspect that OCaml's GC isn't really doing
as much work as the other collectors. There are limits to how efficient
a basic "test & add" allocate and "trace the graph of live objects"
deallocate can be. For single threaded throughput performance, most
copying collectors are pretty close. The differences usually come in
policy decisions, how large are the various areas, aging objects or
not, and average arity of the object graph.

I'd guess that OCaml is able to take advantage of some property
in the language that allows its GC to skip some of the work the
other languages have to do. I would further guess that this isn't
a universal property, there are likely failure modes in which OCaml's
GC performs poorly as a result of such optimizations. At least,  I
haven't seen papers claiming a new found GC technique, which one
would expect if they'd made a major improvement.
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <87zmqp6qq1.fsf@qrnik.zagroda>
"James McIlree" <············@mac.com> writes:

> I'd guess that OCaml is able to take advantage of some property
> in the language that allows its GC to skip some of the work the
> other languages have to do.

I doubt this. OCaml's GC doesn't rely on specific properties of the
language, and the compiler doesn't do much high level optimizations.

For example the GC determines all types dynamically, and OCaml doesn't
try to allocate large objects with dynamic extent on the stack. Floats
are unboxed during local computations and when stored in arrays and in
records consisting solely of floats, but almost nothing else larger
than a machine word is unboxed.

Representation of values: an immediate constant with 1 in the least
significant bit (integers, characters, named constants), or a pointer
to a memory block whose header word contains the constructor number
(8 bits), marks for the GC (2 bits) and the length in words (rest).

There are two most important ranges of constructor numbers, depending
on whether the object consists of values to be scanned by GC, or of
bytes or floats not scanned by GC. There are no mixed representations
of objects consisting partly of scanned and non-scanned values, except
closures which have a code pointer before free variables.

OCaml uses the system stack. There is a central hash table which maps
possible return address to stack frame descriptors, used during GC to
find pointers on the stack.

There are two generations, young of fixed size and old. A minor
collection copies live young objects to the old heap, it's done while
the program is stopped. Major collection is incremental mark & sweep,
a chunk of it is performed while the program is stopped.

When compiled to native code, OCaml uses system threads to implement
OCaml threads, but doesn't allow multiple threads to simultaneously
access the runtime.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: Marco Antoniotti
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <TNBTe.75$DJ5.76188@typhoon.nyu.edu>
Thanks for the interesting summary.

Do you know if OCaml uses any of the "region allocation" techniques that 
were worked out for some implementation of ML (I have recollections of 
some papers by Tofte's on th esubject)

Cheers
--
Marco





Marcin 'Qrczak' Kowalczyk wrote:
> "James McIlree" <············@mac.com> writes:
> 
> 
>>I'd guess that OCaml is able to take advantage of some property
>>in the language that allows its GC to skip some of the work the
>>other languages have to do.
> 
> 
> I doubt this. OCaml's GC doesn't rely on specific properties of the
> language, and the compiler doesn't do much high level optimizations.
> 
> For example the GC determines all types dynamically, and OCaml doesn't
> try to allocate large objects with dynamic extent on the stack. Floats
> are unboxed during local computations and when stored in arrays and in
> records consisting solely of floats, but almost nothing else larger
> than a machine word is unboxed.
> 
> Representation of values: an immediate constant with 1 in the least
> significant bit (integers, characters, named constants), or a pointer
> to a memory block whose header word contains the constructor number
> (8 bits), marks for the GC (2 bits) and the length in words (rest).
> 
> There are two most important ranges of constructor numbers, depending
> on whether the object consists of values to be scanned by GC, or of
> bytes or floats not scanned by GC. There are no mixed representations
> of objects consisting partly of scanned and non-scanned values, except
> closures which have a code pointer before free variables.
> 
> OCaml uses the system stack. There is a central hash table which maps
> possible return address to stack frame descriptors, used during GC to
> find pointers on the stack.
> 
> There are two generations, young of fixed size and old. A minor
> collection copies live young objects to the old heap, it's done while
> the program is stopped. Major collection is incremental mark & sweep,
> a chunk of it is performed while the program is stopped.
> 
> When compiled to native code, OCaml uses system threads to implement
> OCaml threads, but doesn't allow multiple threads to simultaneously
> access the runtime.
> 
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <87ll29m2aq.fsf@qrnik.zagroda>
Marco Antoniotti <·······@cs.nyu.edu> writes:

> Do you know if OCaml uses any of the "region allocation" techniques
> that were worked out for some implementation of ML (I have
> recollections of some papers by Tofte's on th esubject)

It doesn't.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: Ulrich Hobelmann
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <3o8ek3F4p53vU1@individual.net>
Marco Antoniotti wrote:
> Thanks for the interesting summary.
> 
> Do you know if OCaml uses any of the "region allocation" techniques that 
> were worked out for some implementation of ML (I have recollections of 
> some papers by Tofte's on th esubject)

No, the only functional language (AFAIK) with region-based MM is the MLKit.

Maybe the most practical language with regions is Cyclone (an 
ML-inspired C dialect), but it also has GC for the "global region". 
Haven't ever used it, though...

-- 
My ideal for the future is to develop a filesystem remote interface
(a la Plan 9) and then have it implemented across the Internet as
the standard rather than HTML.  That would be ultimate cool.
	Ken Thompson
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <431fb1e5$0$22918$ed2619ec@ptn-nntp-reader01.plus.net>
Ulrich Hobelmann wrote:
> No, the only functional language (AFAIK) with region-based MM is the
> MLKit.

Stalin also does some region analysis.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Ulrich Hobelmann
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <3oaj0lF51i2kU1@individual.net>
Jon Harrop wrote:
> Ulrich Hobelmann wrote:
>> No, the only functional language (AFAIK) with region-based MM is the
>> MLKit.
> 
> Stalin also does some region analysis.

Hm, that's interesting, because Stalin is all about performance.  One 
problem of the MLKit approach was that all the opening and closing 
regions made it somewhat slower than, say, SML/NJ.

-- 
My ideal for the future is to develop a filesystem remote interface
(a la Plan 9) and then have it implemented across the Internet as
the standard rather than HTML.  That would be ultimate cool.
	Ken Thompson
From: Jens Axel Søgaard
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <43203aaa$0$37022$edfadb0f@dread12.news.tele.dk>
Ulrich Hobelmann wrote:
> Jon Harrop wrote:
> 
>> Ulrich Hobelmann wrote:
>>
>>> No, the only functional language (AFAIK) with region-based MM is the
>>> MLKit.
>>
>>
>> Stalin also does some region analysis.
> 
> 
> Hm, that's interesting, because Stalin is all about performance.  One 
> problem of the MLKit approach was that all the opening and closing 
> regions made it somewhat slower than, say, SML/NJ.
> 

If you are interested Siskind has written a paper on (part of)
the techniques used:

   "Flow-Directed Lightweight Closure Conversion".
   Jeffrey Mark Siskind.
   NEC Research Institute, Inc.. Technical Report 99-190R. December 1999.
   <ftp://ftp.ecn.purdue.edu/qobi/fdlcc.pdf>

It ain't for the faint of heart though.

-- 
Jens Axel S�gaard
From: Thomas F. Burdick
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <xcvslwgln6h.fsf@conquest.OCF.Berkeley.EDU>
Marcin 'Qrczak' Kowalczyk <······@knm.org.pl> writes:

> "James McIlree" <············@mac.com> writes:
> 
> > I'd guess that OCaml is able to take advantage of some property
> > in the language that allows its GC to skip some of the work the
> > other languages have to do.
> 
> I doubt this. OCaml's GC doesn't rely on specific properties of the
> language, and the compiler doesn't do much high level optimizations.

Are you sure?  In the context of Lisp, I'd count being purely
functional as a specific property of the language.  I'm doubtful that
the OCaml GC bothers to consider the possibility that old objects
could be pointing to newer ones.

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | Free Mumia Abu-Jamal! |
     ,--'    _,'   | Abolish the racist    |
    /       /      | death penalty!        |
   (   -.  |       `-----------------------'
   |     ) |                               
  (`-.  '--.)                              
   `. )----'                               
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126172627.862133.46860@f14g2000cwb.googlegroups.com>
Thomas F. Burdick wrote:
> Marcin 'Qrczak' Kowalczyk <······@knm.org.pl> writes:
>
> > "James McIlree" <············@mac.com> writes:
> >
> > > I'd guess that OCaml is able to take advantage of some property
> > > in the language that allows its GC to skip some of the work the
> > > other languages have to do.
> >
> > I doubt this. OCaml's GC doesn't rely on specific properties of the
> > language, and the compiler doesn't do much high level optimizations.
>
> Are you sure?  In the context of Lisp, I'd count being purely
> functional as a specific property of the language.  I'm doubtful that
> the OCaml GC bothers to consider the possibility that old objects
> could be pointing to newer ones.

I don't think OCaml is purely functional.  It may be compiled to a
purely functional form by the time it reaches machine code though, I
don't know.
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <87fysfkrt4.fsf@qrnik.zagroda>
···@conquest.OCF.Berkeley.EDU (Thomas F. Burdick) writes:

>> I doubt this. OCaml's GC doesn't rely on specific properties of the
>> language, and the compiler doesn't do much high level optimizations.
>
> Are you sure?  In the context of Lisp, I'd count being purely
> functional as a specific property of the language.  I'm doubtful that
> the OCaml GC bothers to consider the possibility that old objects
> could be pointing to newer ones.

OCaml is not purely functional. Its runtime has a software write
barrier which changes GC marks and pushes the changed location to
an array if needed, and visits these locations during minor GC.

(Some ancient version, perhaps Caml Light, had a different GC which
treated mutable and immutable objects differently, and duplicated
immutable objects in some conditions. I've once seen a paper about it.
But this doesn't apply to current OCaml.)

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126130725.926305.5020@g44g2000cwa.googlegroups.com>
Marcin 'Qrczak' Kowalczyk wrote:
> "James McIlree" <············@mac.com> writes:
>
> > I'd guess that OCaml is able to take advantage of some property
> > in the language that allows its GC to skip some of the work the
> > other languages have to do.
>
> I doubt this. OCaml's GC doesn't rely on specific properties of the
> language, and the compiler doesn't do much high level optimizations.
>
> For example the GC determines all types dynamically, and OCaml doesn't
> try to allocate large objects with dynamic extent on the stack. Floats
> are unboxed during local computations and when stored in arrays and in
> records consisting solely of floats, but almost nothing else larger
> than a machine word is unboxed.
>
> Representation of values: an immediate constant with 1 in the least
> significant bit (integers, characters, named constants), or a pointer
> to a memory block whose header word contains the constructor number
> (8 bits), marks for the GC (2 bits) and the length in words (rest).
>
> There are two most important ranges of constructor numbers, depending
> on whether the object consists of values to be scanned by GC, or of
> bytes or floats not scanned by GC. There are no mixed representations
> of objects consisting partly of scanned and non-scanned values, except
> closures which have a code pointer before free variables.
>
> OCaml uses the system stack. There is a central hash table which maps
> possible return address to stack frame descriptors, used during GC to
> find pointers on the stack.
>
> There are two generations, young of fixed size and old. A minor
> collection copies live young objects to the old heap, it's done while
> the program is stopped. Major collection is incremental mark & sweep,
> a chunk of it is performed while the program is stopped.
>
> When compiled to native code, OCaml uses system threads to implement
> OCaml threads, but doesn't allow multiple threads to simultaneously
> access the runtime.

Since you know rather a lot about the OCaml garbage collector: do you
know why it's good at this particualar benchmark?

It doesn't sound all that different to many GCs to me.
From: Marco Antoniotti
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <8_JTe.77$DJ5.76154@typhoon.nyu.edu>
Rob Thorpe wrote:

> 
> 
> Since you know rather a lot about the OCaml garbage collector: do you
> know why it's good at this particualar benchmark?
> 
> It doesn't sound all that different to many GCs to me.
> 

Without knowning much of the specifics,
I'd bet that the OCaml compiler is just very good at stack allocating 
several intermediate values, especially vectors.
I.e. it does DYNAMIC-EXTENT very well.

Cheers
--
marco
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <431f984e$0$1307$ed2619ec@ptn-nntp-reader02.plus.net>
Marco Antoniotti wrote:
> Rob Thorpe wrote:
> Without knowning much of the specifics,
> I'd bet that the OCaml compiler is just very good at stack allocating
> several intermediate values, especially vectors.
> I.e. it does DYNAMIC-EXTENT very well.

The vectors are all heap allocated, on the young heap where possible.

Have a look at the code around "caml_young_ptr" in the assembler generated
by ocamlopt for the "ray_sphere" function on x86, for example:

camlRay2__ray_sphere_84:
        subl    $24, %esp
.L166:
        movl    %eax, %esi
.L167:  movl    caml_young_ptr, %eax
        subl    $28, %eax
        movl    %eax, caml_young_ptr
        cmpl    caml_young_limit, %eax
        jb      .L168
        leal    4(%eax), %eax
        movl    $6398, -4(%eax)
        fldl    (%ecx)
        fsubl   (%esi)
        fstpl   (%eax)
        fldl    8(%ecx)
        fsubl   8(%esi)
        fstpl   8(%eax)
        fldl    16(%ecx)
        fsubl   16(%esi)
        fstpl   16(%eax)
        fldl    8(%eax)
        fmull   8(%ebx)
        fldl    (%eax)
        fmull   (%ebx)
        faddp   %st, %st(1)
        fldl    16(%eax)
        fmull   16(%ebx)
        faddp   %st, %st(1)
        fstpl   16(%esp)
        fldl    8(%eax)
        fmull   8(%eax)
        fldl    (%eax)
        fmull   (%eax)
        faddp   %st, %st(1)
        fldl    16(%eax)
        fmull   16(%eax)
        faddp   %st, %st(1)
        fldl    16(%esp)
        fmull   16(%esp)
        fsubp   %st, %st(1)
        fldl    (%edx)
        fmull   (%edx)
        faddp   %st, %st(1)
        fstpl   0(%esp)
        fldz
        fcompl  0(%esp)
        fnstsw  %ax
        andb    $69, %ah
        jne     .L165
        movl    camlPervasives + 36, %eax
        addl    $24, %esp
        ret
        .align  16
.L165:
        fldl    0(%esp)
        fsqrt
        fstpl   8(%esp)
        fldl    16(%esp)
        faddl   8(%esp)
        fstpl   0(%esp)
.L170:  movl    caml_young_ptr, %eax
        subl    $12, %eax
        movl    %eax, caml_young_ptr
        cmpl    caml_young_limit, %eax
        jb      .L171
        leal    4(%eax), %ecx
        movl    $2301, -4(%ecx)
        fldl    0(%esp)
        fstpl   (%ecx)
        fldz
        fcompl  0(%esp)
        fnstsw  %ax
        andb    $69, %ah
        jne     .L164
        movl    camlPervasives + 36, %eax
        addl    $24, %esp
        ret
        .align  16
.L164:
        fldl    16(%esp)
        fsubl   8(%esp)
        fstpl   0(%esp)
.L173:  movl    caml_young_ptr, %eax
        subl    $12, %eax
        movl    %eax, caml_young_ptr
        cmpl    caml_young_limit, %eax
        jb      .L174
        leal    4(%eax), %ebx
        movl    $2301, -4(%ebx)
        fldl    0(%esp)
        fstpl   (%ebx)
        fldz
        fcompl  0(%esp)
        fnstsw  %ax
        andb    $69, %ah
        cmpb    $1, %ah
        jne     .L163
        movl    %ebx, %eax
        addl    $24, %esp
        ret
        .align  16
.L163:
        movl    %ecx, %eax
        addl    $24, %esp
        ret
.L174:  call    caml_call_gc
.L175:  jmp     .L173
.L171:  call    caml_call_gc
.L172:  jmp     .L170
.L168:  call    caml_call_gc
.L169:  jmp     .L167

Here's the equivalent generated by gcc from a C implementation for
comparison:

ray_sphere:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $32, %esp
        movl    8(%ebp), %eax
        movl    12(%ebp), %edx
        fldl    8(%eax)
        fsubrl  8(%edx)
        fldl    (%eax)
        fsubrl  (%edx)
        fldl    16(%eax)
        fsubrl  16(%edx)
        fstl    -8(%ebp)
        fxch    %st(2)
        fstl    -16(%ebp)
        fxch    %st(1)
        fstl    -24(%ebp)
        fld     %st(0)
        fmull   24(%eax)
        fld     %st(2)
        fmull   32(%eax)
        faddp   %st, %st(1)
        fld     %st(3)
        fmull   40(%eax)
        faddp   %st, %st(1)
        fld     %st(0)
        fmul    %st(1), %st
        fxch    %st(2)
        fmul    %st(0), %st
        fxch    %st(3)
        fmul    %st(0), %st
        faddp   %st, %st(3)
        fxch    %st(3)
        fmul    %st(0), %st
        faddp   %st, %st(2)
        fsubp   %st, %st(1)
        fldl    24(%edx)
        fmul    %st(0), %st
        faddp   %st, %st(1)
        ftst
        fnstsw  %ax
        testb   $5, %ah
        jne     .L44
        fsqrt
        fld     %st(1)
        fadd    %st(1), %st
        ftst
        fnstsw  %ax
        testb   $5, %ah
        jne     .L43
        fxch    %st(1)
        fsubrp  %st, %st(2)
        fxch    %st(1)
        ftst
        fnstsw  %ax
        testb   $69, %ah
        jne     .L41
        fstp    %st(1)
        leave
        ret
.L43:
        fstp    %st(0)
        .p2align 4,,15
.L44:
        fstp    %st(0)
        fstp    %st(0)
        flds    .LC8
        leave
        ret
.L41:
        fstp    %st(0)
        leave
        ret

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Marco Antoniotti
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <29XTe.79$DJ5.76288@typhoon.nyu.edu>
I'm too lazy to actually figure out what the assmbly corresponds to.  :)

What part of the OCaml code actually generares the assembly?

Cheers
--
Marco



Jon Harrop wrote:
> Marco Antoniotti wrote:
> 
>>Rob Thorpe wrote:
>>Without knowning much of the specifics,
>>I'd bet that the OCaml compiler is just very good at stack allocating
>>several intermediate values, especially vectors.
>>I.e. it does DYNAMIC-EXTENT very well.
> 
> 
> The vectors are all heap allocated, on the young heap where possible.
> 
> Have a look at the code around "caml_young_ptr" in the assembler generated
> by ocamlopt for the "ray_sphere" function on x86, for example:
> 
> camlRay2__ray_sphere_84:
>         subl    $24, %esp
> .L166:
>         movl    %eax, %esi
> .L167:  movl    caml_young_ptr, %eax
>         subl    $28, %eax
>         movl    %eax, caml_young_ptr
>         cmpl    caml_young_limit, %eax
>         jb      .L168
>         leal    4(%eax), %eax
>         movl    $6398, -4(%eax)
>         fldl    (%ecx)
>         fsubl   (%esi)
>         fstpl   (%eax)
>         fldl    8(%ecx)
>         fsubl   8(%esi)
>         fstpl   8(%eax)
>         fldl    16(%ecx)
>         fsubl   16(%esi)
>         fstpl   16(%eax)
>         fldl    8(%eax)
>         fmull   8(%ebx)
>         fldl    (%eax)
>         fmull   (%ebx)
>         faddp   %st, %st(1)
>         fldl    16(%eax)
>         fmull   16(%ebx)
>         faddp   %st, %st(1)
>         fstpl   16(%esp)
>         fldl    8(%eax)
>         fmull   8(%eax)
>         fldl    (%eax)
>         fmull   (%eax)
>         faddp   %st, %st(1)
>         fldl    16(%eax)
>         fmull   16(%eax)
>         faddp   %st, %st(1)
>         fldl    16(%esp)
>         fmull   16(%esp)
>         fsubp   %st, %st(1)
>         fldl    (%edx)
>         fmull   (%edx)
>         faddp   %st, %st(1)
>         fstpl   0(%esp)
>         fldz
>         fcompl  0(%esp)
>         fnstsw  %ax
>         andb    $69, %ah
>         jne     .L165
>         movl    camlPervasives + 36, %eax
>         addl    $24, %esp
>         ret
>         .align  16
> .L165:
>         fldl    0(%esp)
>         fsqrt
>         fstpl   8(%esp)
>         fldl    16(%esp)
>         faddl   8(%esp)
>         fstpl   0(%esp)
> .L170:  movl    caml_young_ptr, %eax
>         subl    $12, %eax
>         movl    %eax, caml_young_ptr
>         cmpl    caml_young_limit, %eax
>         jb      .L171
>         leal    4(%eax), %ecx
>         movl    $2301, -4(%ecx)
>         fldl    0(%esp)
>         fstpl   (%ecx)
>         fldz
>         fcompl  0(%esp)
>         fnstsw  %ax
>         andb    $69, %ah
>         jne     .L164
>         movl    camlPervasives + 36, %eax
>         addl    $24, %esp
>         ret
>         .align  16
> .L164:
>         fldl    16(%esp)
>         fsubl   8(%esp)
>         fstpl   0(%esp)
> .L173:  movl    caml_young_ptr, %eax
>         subl    $12, %eax
>         movl    %eax, caml_young_ptr
>         cmpl    caml_young_limit, %eax
>         jb      .L174
>         leal    4(%eax), %ebx
>         movl    $2301, -4(%ebx)
>         fldl    0(%esp)
>         fstpl   (%ebx)
>         fldz
>         fcompl  0(%esp)
>         fnstsw  %ax
>         andb    $69, %ah
>         cmpb    $1, %ah
>         jne     .L163
>         movl    %ebx, %eax
>         addl    $24, %esp
>         ret
>         .align  16
> .L163:
>         movl    %ecx, %eax
>         addl    $24, %esp
>         ret
> .L174:  call    caml_call_gc
> .L175:  jmp     .L173
> .L171:  call    caml_call_gc
> .L172:  jmp     .L170
> .L168:  call    caml_call_gc
> .L169:  jmp     .L167
> 
> Here's the equivalent generated by gcc from a C implementation for
> comparison:
> 
> ray_sphere:
>         pushl   %ebp
>         movl    %esp, %ebp
>         subl    $32, %esp
>         movl    8(%ebp), %eax
>         movl    12(%ebp), %edx
>         fldl    8(%eax)
>         fsubrl  8(%edx)
>         fldl    (%eax)
>         fsubrl  (%edx)
>         fldl    16(%eax)
>         fsubrl  16(%edx)
>         fstl    -8(%ebp)
>         fxch    %st(2)
>         fstl    -16(%ebp)
>         fxch    %st(1)
>         fstl    -24(%ebp)
>         fld     %st(0)
>         fmull   24(%eax)
>         fld     %st(2)
>         fmull   32(%eax)
>         faddp   %st, %st(1)
>         fld     %st(3)
>         fmull   40(%eax)
>         faddp   %st, %st(1)
>         fld     %st(0)
>         fmul    %st(1), %st
>         fxch    %st(2)
>         fmul    %st(0), %st
>         fxch    %st(3)
>         fmul    %st(0), %st
>         faddp   %st, %st(3)
>         fxch    %st(3)
>         fmul    %st(0), %st
>         faddp   %st, %st(2)
>         fsubp   %st, %st(1)
>         fldl    24(%edx)
>         fmul    %st(0), %st
>         faddp   %st, %st(1)
>         ftst
>         fnstsw  %ax
>         testb   $5, %ah
>         jne     .L44
>         fsqrt
>         fld     %st(1)
>         fadd    %st(1), %st
>         ftst
>         fnstsw  %ax
>         testb   $5, %ah
>         jne     .L43
>         fxch    %st(1)
>         fsubrp  %st, %st(2)
>         fxch    %st(1)
>         ftst
>         fnstsw  %ax
>         testb   $69, %ah
>         jne     .L41
>         fstp    %st(1)
>         leave
>         ret
> .L43:
>         fstp    %st(0)
>         .p2align 4,,15
> .L44:
>         fstp    %st(0)
>         fstp    %st(0)
>         flds    .LC8
>         leave
>         ret
> .L41:
>         fstp    %st(0)
>         leave
>         ret
> 
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126195773.673121.245330@z14g2000cwz.googlegroups.com>
Marco Antoniotti wrote:
> I'm too lazy to actually figure out what the assmbly corresponds to.  :)
>
> What part of the OCaml code actually generares the assembly?

I don't think the precise machine language used is all that relevant.
The original point was that the ocamlopt version of the raytracer
conses just as much as the SBCL/CMUCL version does - see my discussion
with Jon above.

But the Ocamlopt program doesn't need GC setting to every 100MB in
order to perform well, it works OK GCing much more often.  How does it
manage this?

(The root set of the ocamlopt program will be much smaller than that of
the entire lisp, which may have something to do with it)
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <4320e43b$0$22934$ed2619ec@ptn-nntp-reader01.plus.net>
Rob Thorpe wrote:
> But the Ocamlopt program doesn't need GC setting to every 100MB in
> order to perform well, it works OK GCing much more often.  How does it
> manage this?

It does 1000x as many minor collections as major ones.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <878xy7kqh4.fsf@qrnik.zagroda>
"Rob Thorpe" <·············@antenova.com> writes:

> Since you know rather a lot about the OCaml garbage collector:
> do you know why it's good at this particualar benchmark?

I don't know. The runtime is quite straightforward. It doesn't even
do much inlining, and there are some awful cases where it does a silly
work: array operations which remain polymorphic perform a check for
float arrays at runtime, and this check is not hoisted out of loops.
Copying & pasting standard array functions ad specializing them to
a concrete element type generates faster code than using standard
library functions (the check is optimized out if the type of elements
is known statically).

I guess the language maps quite well to current processors: statically
typed; integer arithmetic silently wraps around; the system stack is
used, without frame pointers - just temporaries and return addresses;
function calling convention passes arguments in registers and thus
is faster than the default C convention; tail calls are performed as
jumps.


Marco Antoniotti <·······@cs.nyu.edu> writes:

> Without knowning much of the specifics,
> I'd bet that the OCaml compiler is just very good at stack allocating
> several intermediate values, especially vectors.
> I.e. it does DYNAMIC-EXTENT very well.

It doesn't do that at all. It unboxes only floats and perhaps int32
and int64, and does this only for local computation - the function
call protocol passes all arguments using the generic pointer-sized
value representation with floats boxed.

GC is fast for lots of shortly lived garbage. Object allocation
usually takes 5 instructions, with adjacent allocations coalesced.
If an object dies before the next GC, the runtime spends zero
additional time freeing it.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126272810.195868.185990@z14g2000cwz.googlegroups.com>
Marcin 'Qrczak' Kowalczyk wrote:
> "Rob Thorpe" <·············@antenova.com> writes:
>
> > Since you know rather a lot about the OCaml garbage collector:
> > do you know why it's good at this particualar benchmark?
>
> I don't know. The runtime is quite straightforward. It doesn't even
> do much inlining, and there are some awful cases where it does a silly
> work: array operations which remain polymorphic perform a check for
> float arrays at runtime, and this check is not hoisted out of loops.
> Copying & pasting standard array functions ad specializing them to
> a concrete element type generates faster code than using standard
> library functions (the check is optimized out if the type of elements
> is known statically).
>
> I guess the language maps quite well to current processors: statically
> typed; integer arithmetic silently wraps around; the system stack is
> used, without frame pointers - just temporaries and return addresses;
> function calling convention passes arguments in registers and thus
> is faster than the default C convention; tail calls are performed as
> jumps.

I see, sounds a good way to build a fast language.

> Marco Antoniotti <·······@cs.nyu.edu> writes:
>
> > Without knowning much of the specifics,
> > I'd bet that the OCaml compiler is just very good at stack allocating
> > several intermediate values, especially vectors.
> > I.e. it does DYNAMIC-EXTENT very well.
>
> It doesn't do that at all. It unboxes only floats and perhaps int32
> and int64, and does this only for local computation - the function
> call protocol passes all arguments using the generic pointer-sized
> value representation with floats boxed.
>
> GC is fast for lots of shortly lived garbage. Object allocation
> usually takes 5 instructions, with adjacent allocations coalesced.
> If an object dies before the next GC, the runtime spends zero
> additional time freeing it.

So, I assume it works something like this:
There is a head pointer marking the gap between used-space and
free-space in the heap.  The GC finds no pointers into the most recent
bit allocated, so it just winds the head pointer back to the last bit
that's still visible.

Is that right?
From: Marco Antoniotti
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <6UfUe.2$pa3.186@typhoon.nyu.edu>
Marcin 'Qrczak' Kowalczyk wrote:
> 
> Marco Antoniotti <·······@cs.nyu.edu> writes:
> 
> 
>>Without knowning much of the specifics,
>>I'd bet that the OCaml compiler is just very good at stack allocating
>>several intermediate values, especially vectors.
>>I.e. it does DYNAMIC-EXTENT very well.
> 
> 
> It doesn't do that at all. It unboxes only floats and perhaps int32
> and int64, and does this only for local computation - the function
> call protocol passes all arguments using the generic pointer-sized
> value representation with floats boxed.
> 
> GC is fast for lots of shortly lived garbage. Object allocation
> usually takes 5 instructions, with adjacent allocations coalesced.
> If an object dies before the next GC, the runtime spends zero
> additional time freeing it.


Ok.  But how does it work.  (I am curious and I don't know) For what you 
are saying a lot of the intermediate vectors that are created in the ray 
tracer "die before the next GC".  How is this accomplished?

Cheers
--
Marco
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <87psrinmu1.fsf@qrnik.zagroda>
Marco Antoniotti <·······@cs.nyu.edu> writes:

> Ok.  But how does it work.  (I am curious and I don't know) For what
> you are saying a lot of the intermediate vectors that are created in
> the ray tracer "die before the next GC".  How is this accomplished?

I haven't looked at this particular example, I don't know what is the
distribution of lifetime of its objects. I only more or less know how
the OCaml runtime works (a rough idea based on looking at the source,
hopefully I don't tell incorrect information here) and what its GC is
good at.

The heap is divided into two parts: young heap (32k words by default)
and old heap (several disjoint blocks). Objects not larger than 256
words are allocated from the young heap, larger objects are allocated
from the old heap.

Allocation of a small object works like this: move the heap pointer
down (allocation starts from the end); if it has fallen below the
limit, move it back up, perform a GC, and move it down again.

A GC first empties the yound heap by moving all live objects from it
into the old heap; starting from the stack, global variables, and
locations mutated since last GC; and tracing objects trough pointers.
Then it performs a slice of the major GC, an incremental mark & sweep
of the old heap. Both the mark phase and the sweep phase are performed
incrementally.

The write barrier is needed if the modified location is inside an old
object and points to a young object.

Since only live young objects are touched during emptying the young
heap, allocating lots of short-lived objects is cheap.


There are aspects that I see in the code but don't understand:

Two bits per objects are somehow used to make the GC not lose track
which objects have been already processed, when the program mutates
the heap between collections.

There is some tricky computation which determines how much work of the
mark & sweep needs to be done in the current slice, so the collection
is fast enough that the program doesn't keep too much memory allocated,
and slow enough that the program doesn't spend too much time in GC.

There is very rarely triggered compaction of the old heap, performed
in place and in one slice.

In some circumstances unknown to me there are pointers pointing into
the middle of objects, and words which look as integers when viewed
as fields of another object but also they play the role of object
headers.


"Rob Thorpe" <·············@antenova.com> writes:

> So, I assume it works something like this:
> There is a head pointer marking the gap between used-space and
> free-space in the heap.  The GC finds no pointers into the most recent
> bit allocated, so it just winds the head pointer back to the last bit
> that's still visible.

No, there is nothing like this.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: Marco Antoniotti
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <HfkUe.4$pa3.425@typhoon.nyu.edu>
Thank you.  Very good summary.

Cheers
--
Marco



Marcin 'Qrczak' Kowalczyk wrote:
> Marco Antoniotti <·······@cs.nyu.edu> writes:
> 
> 
>>Ok.  But how does it work.  (I am curious and I don't know) For what
>>you are saying a lot of the intermediate vectors that are created in
>>the ray tracer "die before the next GC".  How is this accomplished?
> 
> 
> I haven't looked at this particular example, I don't know what is the
> distribution of lifetime of its objects. I only more or less know how
> the OCaml runtime works (a rough idea based on looking at the source,
> hopefully I don't tell incorrect information here) and what its GC is
> good at.
> 
> The heap is divided into two parts: young heap (32k words by default)
> and old heap (several disjoint blocks). Objects not larger than 256
> words are allocated from the young heap, larger objects are allocated
> from the old heap.
> 
> Allocation of a small object works like this: move the heap pointer
> down (allocation starts from the end); if it has fallen below the
> limit, move it back up, perform a GC, and move it down again.
> 
> A GC first empties the yound heap by moving all live objects from it
> into the old heap; starting from the stack, global variables, and
> locations mutated since last GC; and tracing objects trough pointers.
> Then it performs a slice of the major GC, an incremental mark & sweep
> of the old heap. Both the mark phase and the sweep phase are performed
> incrementally.
> 
> The write barrier is needed if the modified location is inside an old
> object and points to a young object.
> 
> Since only live young objects are touched during emptying the young
> heap, allocating lots of short-lived objects is cheap.
> 
> 
> There are aspects that I see in the code but don't understand:
> 
> Two bits per objects are somehow used to make the GC not lose track
> which objects have been already processed, when the program mutates
> the heap between collections.
> 
> There is some tricky computation which determines how much work of the
> mark & sweep needs to be done in the current slice, so the collection
> is fast enough that the program doesn't keep too much memory allocated,
> and slow enough that the program doesn't spend too much time in GC.
> 
> There is very rarely triggered compaction of the old heap, performed
> in place and in one slice.
> 
> In some circumstances unknown to me there are pointers pointing into
> the middle of objects, and words which look as integers when viewed
> as fields of another object but also they play the role of object
> headers.
> 
> 
> "Rob Thorpe" <·············@antenova.com> writes:
> 
> 
>>So, I assume it works something like this:
>>There is a head pointer marking the gap between used-space and
>>free-space in the heap.  The GC finds no pointers into the most recent
>>bit allocated, so it just winds the head pointer back to the last bit
>>that's still visible.
> 
> 
> No, there is nothing like this.
> 
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <4321f710$0$1278$ed2619ec@ptn-nntp-reader02.plus.net>
Marcin 'Qrczak' Kowalczyk wrote:
> Two bits per objects are somehow used to make the GC not lose track
> which objects have been already processed, when the program mutates
> the heap between collections.

Is 1 bit mark and the other tagged int vs pointer?

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Sylvain
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <0LadnZ2dnZ2VFMT-nZ2dnWPYv96dnZ2dRVn-yZ2dnZ0@speakeasy.net>
Jon Harrop wrote:
> Is 1 bit mark and the other tagged int vs pointer?
> 

you need more than one bit for marking an object in
an incremental graph tracing garbage collector (incremental
GC is garbage collection proceeding while the application,
i.e., 'mutator',  keeps doing its thing;  otherwise,
you'd have to resort to a 'stop the world' approach,
which actually is not as bad as it sound in real life
but I digress);  the idea is that the GC initially marks
root objects as grey;  only grey object is subsequently
scanned. When you scan a grey object,  you mark the objects
it points to.  A grey object that then only points to
objects either grey or black are themselves marked
black.  Eventually all grey objects become black,
and then white objects are known to be garbage; for
the thing to work,  no black objects can hold a
pointer denoting a white object directly (because
a black object won't be scanned);  to achieve
that you can either use a read or write barrier
mechanism;  the former traps attempt by the application
(the 'mutator') to dereference a pointer denoting
a white object;  the latter traps attempt to copy
a pointer denoting a white object into a black cell;
write barriers are usually preferable, because applications
tend to write pointers less often than they read them,
and an effective write barrier mechanism can be
implemented using hardware tricks (via virtual
memory page protection for instance,  some processors
even support it directly);

well,  I use to be a garbage guy in the ol' days, sorry
for the rambling :-)

--Sylvain
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <4322dde6$0$17468$ed2e19e4@ptn-nntp-reader04.plus.net>
Sylvain wrote:
> well,  I use to be a garbage guy in the ol' days, sorry
> for the rambling :-)

Not at all, thanks for the explanation. :-)

This begs the question: how come ints have 1 spare bit?

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Sylvain
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <VKSdnfEFef7dq77eRVn-tw@speakeasy.net>
Jon Harrop wrote:
> Sylvain wrote:
> 
>>well,  I use to be a garbage guy in the ol' days, sorry
>>for the rambling :-)
> 
> 
> Not at all, thanks for the explanation. :-)
> 
> This begs the question: how come ints have 1 spare bit?

I am new to lisp (am still going through the reading list
that I picked on this newsgroup and doing very basic stuff)
and I am not familiar with Lisp runtimes (I am more familiar
with Java VMs gory details) but I'll venture a wild
guess anyway :-) :  it looks to me like a tag;  one difficulty
with implementing and efficient GC is identifying pointers
unambiguously:  there are fancy techniques like building stack
maps that can be used to remove such ambiguities,  or using
heuristics (but it becomes tricky if pointers are allowed to
points inside objects),  but if you can spare a bit for tagging
things that  might look like a pointer,  it is certainly
simpler (actually tagging can be supported in hardware);
misidentifying a pointer for an integer can lead you to
collecting live objects, while the opposite can lead to
really interesting bugs if you are using some kind of copying
GC, e.g., generational or copy-compact -- whereby GC can
change pointers denoting objects that are moved.

--Sylvain
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126388895.431526.30490@g44g2000cwa.googlegroups.com>
Sylvain wrote:
> Jon Harrop wrote:
> > Sylvain wrote:
> >
> >>well,  I use to be a garbage guy in the ol' days, sorry
> >>for the rambling :-)
> >
> >
> > Not at all, thanks for the explanation. :-)
> >
> > This begs the question: how come ints have 1 spare bit?
>
> I am new to lisp (am still going through the reading list
> that I picked on this newsgroup and doing very basic stuff)
> and I am not familiar with Lisp runtimes (I am more familiar
> with Java VMs gory details)

Whoa there.  Are you talking about why a lisp GC should use more than
one bit or why the OCaml GC does it?
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <87y865sq1z.fsf@qrnik.zagroda>
Jon Harrop <······@jdh30.plus.com> writes:

>> Two bits per objects are somehow used to make the GC not lose track
>> which objects have been already processed, when the program mutates
>> the heap between collections.
>
> Is 1 bit mark and the other tagged int vs pointer?

No. A value is either an int or a pointer, distinguished by 1 bit.
If it's a pointer, it points to an object with a one-word header and
some fields or bytes. The header includes 2 bits reserved for GC.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <4321fc83$0$1278$ed2619ec@ptn-nntp-reader02.plus.net>
James McIlree wrote:
> I'd guess that OCaml is able to take advantage of some property
> in the language that allows its GC to skip some of the work the
> other languages have to do.

OCaml simply has a more efficient GC. MLton and Stalin can be much faster
than OCaml (and vastly faster than Lisp) quite possibly because they use
more sophisticated techniques.

> I would further guess that this isn't 
> a universal property, there are likely failure modes in which OCaml's
> GC performs poorly as a result of such optimizations.

There is no actual evidence of that. Indeed, there isn't even a sound
theoretical reason to think that.

> At least,  I 
> haven't seen papers claiming a new found GC technique, which one
> would expect if they'd made a major improvement.

Several people seem to be assuming that OCaml is the outlier here. It is
not. Of the functional languages that I have seen, SBCL is unusually slow.
OCaml is in the middle and MLton and Stalin are at the front. So you should
be asking why SBCL is the outlier, rather than looking for (and not
finding) evidence of why OCaml is "unusually fast".

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: James McIlree
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126317218.833926.130260@g44g2000cwa.googlegroups.com>
Jon Harrop wrote:

> OCaml simply has a more efficient GC. MLton and Stalin can be much faster
> than OCaml (and vastly faster than Lisp) quite possibly because they use
> more sophisticated techniques.

Utter bunk. You draw that conclusion from one test, without even an
understanding
of why you got the results you did?

As a "scientist", you're supposed to performs tests and let the results
guide
your opinion.

As an "evangalist" you find the results that support the opinion you
want.

Which are you?
From: Jon Harrop
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <4322d88e$0$17468$ed2e19e4@ptn-nntp-reader04.plus.net>
James McIlree wrote:
> Jon Harrop wrote:
>> OCaml simply has a more efficient GC. MLton and Stalin can be much faster
>> than OCaml (and vastly faster than Lisp) quite possibly because they use
>> more sophisticated techniques.
> 
> Utter bunk. You draw that conclusion from one test,

As there are four variations on the ray tracer, you could count it as more
than one test.

> without even an understanding of why you got the results you did?

I draw that conclusion from all the tests I've seen. Mostly the shootout:
OCaml is faster (Full CPU time) than SBCL on 8/9 tests, Stalin on 1/1,
MLton on 10/10.

Of those 20 comparisons, the one shootout test that SBCL is (12%) faster
than OCaml on, nsieve, gets quite the opposite result on my computer when
the benchmark takes a reasonable amount of time to run:

$ time ./sieve 12
Primes up to 40960000  2488465

real    0m9.521s
user    0m8.890s
sys     0m0.040s

* (time (test 12))
Primes up to 40960000  2488465
Evaluation took:
                 13.393 seconds of real time
                 12.75 seconds of user run time
                 0.09 seconds of system run time
                 0 page faults and
                 5,120,008 bytes consed.

Note that longer benchmarks are less biased against SBCL because it has a
15x slower startup time than most other languages.

However, this test isn't consing values anything like as quickly as the ray
tracer (<1Mb/sec here compared to >300Mb/sec in the ray tracer), so GC
isn't to blame. It looks as though OCaml's code gen is much better, at
least for AMD hardware. Several other people reported that SBCL is much
more competitive at running my ray tracer on Intel hardware.

> As a "scientist", you're supposed to performs tests and let the results
> guide your opinion.
>
> As an "evangalist" you find the results that support the opinion you
> want.
>
> Which are you?

How much evidence do you want? That's over 10 tests and >95% of the time
SBCL is slower, often much slower, than the other FPLs.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com
From: Joe Marshall
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <64td4qzz.fsf@alum.mit.edu>
"Rob Thorpe" <·············@antenova.com> writes:

> Jon Harrop wrote:
>>
>> If I am reading that correctly then minor_words*4 = 10Gb was recycled on
>> x86. On AMD64 I get 17Gb.
>
> That is a good GC.  I'm surprised that a GC can be that good.  The
> problem with the GC in CMUCL and SBCL is that it only works well in
> this case if used very infrequently.  So the fastest times are had by
> GCing very infrequently.  This ideally shouldn't be the case, since
> doing it like this puts a lot of strain on the memory subsystem.  It
> indicates it may be taking longer over the mark phase than the ocamlopt
> GC.

What do you mean by `strain on the memory subsystem'?  I picture the
RAM chips starting to bulge because of all the bits being stuffed in
them.

In a generational GC, it is can be a good idea to delay collecting in
order for objects to `age' enough to become collectable.  If the GC is
a copying GC, then the flip is mostly proportional to the amount of
live storage, so you can get maximum performance by using as much RAM
as you possibly can (it isn't like you are saving it for something)
and so delay the flip as long as you can without spilling out to the
disk.  When the flip eventually does occur, the vast majority of the
data in RAM is unused and therefore untraced and the flip is nearly
instantaneous.

> (The reason I doubted it was the GC is that in lisps very few types of
> object can contain free pointers, so GC is theoretically easier than in
> many other languages.)

Huh?  Nearly any object created with defstruct or defclass has pointers.

~jrm
From: mikel
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <NgrTe.5841$mn6.13700214@news.sisna.com>
Joe Marshall wrote:
> "Rob Thorpe" <·············@antenova.com> writes:
> 
> 
>>Jon Harrop wrote:
>>
>>>If I am reading that correctly then minor_words*4 = 10Gb was recycled on
>>>x86. On AMD64 I get 17Gb.
>>
>>That is a good GC.  I'm surprised that a GC can be that good.  The
>>problem with the GC in CMUCL and SBCL is that it only works well in
>>this case if used very infrequently.  So the fastest times are had by
>>GCing very infrequently.  This ideally shouldn't be the case, since
>>doing it like this puts a lot of strain on the memory subsystem.  It
>>indicates it may be taking longer over the mark phase than the ocamlopt
>>GC.
> 
> 
> What do you mean by `strain on the memory subsystem'?  I picture the
> RAM chips starting to bulge because of all the bits being stuffed in
> them.

Don't forget the clogging caused by the sooty accumulation of hot 
exhaust bits.
From: Greg Menke
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <m3d5nlis6u.fsf@athena.pienet>
mikel <·····@evins.net> writes:

> Joe Marshall wrote:
> > "Rob Thorpe" <·············@antenova.com> writes:
> >
> >>Jon Harrop wrote:
> >>
> >>>If I am reading that correctly then minor_words*4 = 10Gb was recycled on
> >>>x86. On AMD64 I get 17Gb.
> >>
> >>That is a good GC.  I'm surprised that a GC can be that good.  The
> >>problem with the GC in CMUCL and SBCL is that it only works well in
> >>this case if used very infrequently.  So the fastest times are had by
> >>GCing very infrequently.  This ideally shouldn't be the case, since
> >>doing it like this puts a lot of strain on the memory subsystem.  It
> >>indicates it may be taking longer over the mark phase than the ocamlopt
> >>GC.
> > What do you mean by `strain on the memory subsystem'?  I picture the
> > RAM chips starting to bulge because of all the bits being stuffed in
> > them.
> 
> Don't forget the clogging caused by the sooty accumulation of hot
> exhaust bits.

I ran a Lisp program once that strained the memory subsystem till all
the 0 bits burst- bit juice all inside the computer...  Eeew.

Gregm
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126087446.869490.138520@g49g2000cwa.googlegroups.com>
Joe Marshall wrote:
> "Rob Thorpe" <·············@antenova.com> writes:
>
> > Jon Harrop wrote:
> >>
> >> If I am reading that correctly then minor_words*4 = 10Gb was recycled on
> >> x86. On AMD64 I get 17Gb.
> >
> > That is a good GC.  I'm surprised that a GC can be that good.  The
> > problem with the GC in CMUCL and SBCL is that it only works well in
> > this case if used very infrequently.  So the fastest times are had by
> > GCing very infrequently.  This ideally shouldn't be the case, since
> > doing it like this puts a lot of strain on the memory subsystem.  It
> > indicates it may be taking longer over the mark phase than the ocamlopt
> > GC.
>
> What do you mean by `strain on the memory subsystem'?  I picture the
> RAM chips starting to bulge because of all the bits being stuffed in
> them.

What I mean is that going through 100Mb of memory or so is quite a
significant operation.  Caches these days are not that large compared
to main memory size, I think my Athlon has a 256kB L2 data cache and
128kB L1 data cache.  If the GC goes through a similarly large area of
memory it will thrash the cache significantly while doing it and
afterward much of the caches contents will be useless.

(At this point, if people like they can think of the RAM chips being
whipped)

> In a generational GC, it is can be a good idea to delay collecting in
> order for objects to `age' enough to become collectable.  If the GC is
> a copying GC, then the flip is mostly proportional to the amount of
> live storage, so you can get maximum performance by using as much RAM
> as you possibly can (it isn't like you are saving it for something)
> and so delay the flip as long as you can without spilling out to the
> disk.  When the flip eventually does occur, the vast majority of the
> data in RAM is unused and therefore untraced and the flip is nearly
> instantaneous.

I'd agree with all that.  But, if some GC can be done on data in the
cache before it leaves then that is beneficial.  (I will admit I've
never met a GC that does this, I've just read somewhere that it's a
good strategy).

> > (The reason I doubted it was the GC is that in lisps very few types of
> > object can contain free pointers, so GC is theoretically easier than in
> > many other languages.)
>
> Huh?  Nearly any object created with defstruct or defclass has pointers.

I don't know about defclass, but defstruct is a macro that uses cons
cells and symbols internally isn't it?

What I meant was, of the basic data types you can put them into two
classes, those that contain pointers to other lisp objects and those
that don't.
From: Joe Marshall
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <slwg3j3m.fsf@alum.mit.edu>
"Rob Thorpe" <·············@antenova.com> writes:

> What I mean is that going through 100Mb of memory or so is quite a
> significant operation.  Caches these days are not that large compared
> to main memory size, I think my Athlon has a 256kB L2 data cache and
> 128kB L1 data cache.  If the GC goes through a similarly large area of
> memory it will thrash the cache significantly while doing it and
> afterward much of the caches contents will be useless.

I have wondered about this.  Making the youngest generation small
enough to fit in the cache would also increase the frequency of flips
on that generation.  256K is pretty small compared to main memory
these days, so objects would not have much time to `age' and more
would be retained.  I'm not sure if extra speed of the cache would be
enough to make up for the extra retention.  It'd be interesting to
experiment with this.
From: ······@earthlink.net
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126275157.261404.127850@g14g2000cwa.googlegroups.com>
Rob Thorpe wrote:
> I don't know about defclass, but defstruct is a macro that uses cons
> cells and symbols internally isn't it?

Huh?  You can ask defstruct to use cons cells, but the default is
something more like a c struct.  I don't remember a "use symbols"
option and don't know
how that would be a useful way to store data, including pointers to
other other lisp things.
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126293676.578595.112640@g43g2000cwa.googlegroups.com>
······@earthlink.net wrote:
> Rob Thorpe wrote:
> > I don't know about defclass, but defstruct is a macro that uses cons
> > cells and symbols internally isn't it?
>
> Huh?  You can ask defstruct to use cons cells, but the default is
> something more like a c struct.  I don't remember a "use symbols"
> option and don't know
> how that would be a useful way to store data, including pointers to
> other other lisp things.

You're right, it doesn't use symbols.  What I meant was that it uses
normal lisp data types.
From: Joerg Hoehle
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <uu0gqekvg.fsf@users.sourceforge.net>
"Rob Thorpe" <·············@antenova.com> writes:
> Joe Marshall wrote:
> > Huh?  Nearly any object created with defstruct or defclass has pointers.

> I don't know about defclass, but defstruct is a macro that uses cons
> cells and symbols internally isn't it?

Huh? You've been reading a paper from a 1970'ies Lisp or what? You
know, thoses Lisps that only knew CONS, FIXNUM, SYMBOL types and where
(CDR <some-symbol>) would get to its property list, which would of
course also hold the symbol's print name. :)

Try DISASSEMBLE in a Common Lisp with a native code compiler.

Did you ever use DEFSTRUCT without its TYPE option?

Actually, you're completely right: any macro uses cons cells and
symbols internally for the simple reason that it must expand to
another Lisp expression. It is common knowledge that these are mostly
made of cons cells and symbols, backquote (for the non-purists) and a
few minor data types named atoms but not otherwise worth mentioning. :-)

Regards,
	Jorg Hohle
Telekom/T-Systems Technology Center
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126521023.955079.250850@z14g2000cwz.googlegroups.com>
Joerg Hoehle wrote:
> "Rob Thorpe" <·············@antenova.com> writes:
> > Joe Marshall wrote:
> > > Huh?  Nearly any object created with defstruct or defclass has pointers.
>
> > I don't know about defclass, but defstruct is a macro that uses cons
> > cells and symbols internally isn't it?
>
> Huh? You've been reading a paper from a 1970'ies Lisp or what? You
> know, thoses Lisps that only knew CONS, FIXNUM, SYMBOL types and where
> (CDR <some-symbol>) would get to its property list, which would of
> course also hold the symbol's print name. :)

No, too much Emacs Lisp.

> Try DISASSEMBLE in a Common Lisp with a native code compiler.
>
> Did you ever use DEFSTRUCT without its TYPE option?
>
> Actually, you're completely right: any macro uses cons cells and
> symbols internally for the simple reason that it must expand to
> another Lisp expression. It is common knowledge that these are mostly
> made of cons cells and symbols, backquote (for the non-purists) and a
> few minor data types named atoms but not otherwise worth mentioning. :-)

:) Never heard of them, do they do anything important.

(See my previous reply - I meant variable not symbol)
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126562759.858850.276800@g44g2000cwa.googlegroups.com>
Rob Thorpe wrote:
> Joe Marshall wrote:
> > "Rob Thorpe" <·············@antenova.com> writes:
> > > Jon Harrop wrote:
> > > (The reason I doubted it was the GC is that in lisps very few types of
> > > object can contain free pointers, so GC is theoretically easier than in
> > > many other languages.)
> >
> > Huh?  Nearly any object created with defstruct or defclass has pointers.
>
> I don't know about defclass, but defstruct is a macro that uses cons
> cells and symbols internally isn't it?
>
> What I meant was, of the basic data types you can put them into two
> classes, those that contain pointers to other lisp objects and those
> that don't.

What I meant here is that if the implementor decides to do it then in
lisp a clear separation can be made between objects that contain
pointers and those that don't.

Cons cells, symbol tables, atoms, arrays, vectors and structs can
contain pointers.  Numbers, symbols, characters and strings can't.
Loads of lisp internals will fall into either category.

So, if the implementor wants to, then two heaps can be used for these
types.  More could be used if it's useful.  The one containing pointers
must be marked, the marks will enter the non-pointer heap but can't go
further.  Both must be swept, but different strategies can be used to
repack the different heaps.  There's no chance of a pointer waiting to
be freed in the middle of a string for example.

I think all this is true about Common Lisp, it's certainly works for
some other lisps and some other languages.
From: rydis (Martin Rydstr|m) @CD.Chalmers.SE
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <w4c7jdl9r97.fsf@boris.cd.chalmers.se>
"Rob Thorpe" <·············@antenova.com> writes:
> Cons cells, symbol tables, atoms, arrays, vectors and structs can
> contain pointers.  Numbers, symbols, characters and strings can't.
> Loads of lisp internals will fall into either category.

Numbers, for instance complex, floating point or bignums can, and
probably will, contain pointers.

A symbol will probably contain several pointers.

Characters may or may not contain pointers, depending on the
implementation.

Strings most certainly can contain pointers.

'mr

-- 
[Emacs] is written in Lisp, which is the only computer language that is
beautiful.  -- Neal Stephenson, _In the Beginning was the Command Line_
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126611203.738263.32500@g43g2000cwa.googlegroups.com>
Martin Rydstr|m Martin Rydstr|m wrote:
> "Rob Thorpe" <·············@antenova.com> writes:
> > Cons cells, symbol tables, atoms, arrays, vectors and structs can
> > contain pointers.  Numbers, symbols, characters and strings can't.
> > Loads of lisp internals will fall into either category.
>
> Numbers, for instance complex, floating point or bignums can, and
> probably will, contain pointers.

The others could and bignums probably will contain pointers.  These are
part of their implementation though, so it can be arranged so they're
separate.

> A symbol will probably contain several pointers.

Those would be symbols associated with the symbol table.  I meant the
string part of the symbol contains no pointers itself.

> Characters may or may not contain pointers, depending on the
> implementation.
>
> Strings most certainly can contain pointers.

They have fill pointers, but are you sure they contain any other sorts
of pointers?
From: Brian Downing
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <vkFVe.347238$xm3.170035@attbi_s21>
In article <·······················@g43g2000cwa.googlegroups.com>,
Rob Thorpe <·············@antenova.com> wrote:
> They have fill pointers, but are you sure they contain any other sorts
> of pointers?

I don't believe it says anywhere that CHARACTERs have to be immediate
representations.  Indeed, no guarantees are made of the result of 
(eq #\a #\a).  So characters can be pointers.

Since STRINGs contain CHARACTERs, then strings must be allowed to
contain pointers.

-bcd
-- 
*** Brian Downing <bdowning at lavos dot net> 
From: Rob Thorpe
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <1126647955.071028.61670@f14g2000cwb.googlegroups.com>
Brian Downing wrote:
> In article <·······················@g43g2000cwa.googlegroups.com>,
> Rob Thorpe <·············@antenova.com> wrote:
> > They have fill pointers, but are you sure they contain any other sorts
> > of pointers?
>
> I don't believe it says anywhere that CHARACTERs have to be immediate
> representations.  Indeed, no guarantees are made of the result of
> (eq #\a #\a).  So characters can be pointers.

Yes.

> Since STRINGs contain CHARACTERs, then strings must be allowed to
> contain pointers.

My original point was about what the implementor could do with the
implementation.  The things in the list I (recklessly) gave as "not
containing pointers" are more accurately things that need not contain
pointers to other lisp objects - if the implementor chooses to do
things this way.
From: Julian Stecklina
Subject: Re: GC in Jon's raytracing benchmark
Date: 
Message-ID: <86acis60sf.fsf@dellbeast.localnet>
Jon Harrop <······@jdh30.plus.com> writes:

> I'm not sure that is the whole picture though. In comparison, I don't
> believe ocamlopt will optimise the temporary allocation away and (with no
> inlining: "-inline 0") ocamlopt is still much faster than SBCL, taking
> 21.7s instead of 17.4s with inlining (-inline 100). This indicates to me
> that inlining is one possible solution but a better solution is to have a
> GC that is much faster at allocating and immediately collecting
> temporaries. So my guess is that SBCL's GC needs some work.

I always assumed that a generational copying garbage collector is
especially efficient with lots of very short-lived objects compared to
other GC schemes, as allocation is just some pointer bumping and the
time spent for scavenging the first generation is proportional to the
amount of live objects.

Can anyone explain why SBCL seems to perform sub-optimal in this
scenario? Or does it not? I am aware of cache issues, but always
thought that if the first generation matches the size of, say, the
second level cache, they were negligable.

(On a sidenote: The sbcl-internal page on GENCGC is quite empty. :-/)

Regards,
-- 
Julian Stecklina

(Of course SML does have its weaknesses, but by comparison, a
 discussion of C++'s strengths and flaws always sounds like an
 argument about whether one should face north or east when one
 is sacrificing one's goat to the rain god.)         -- Thant Tessman