Garbage Collection and Games

From: Philip Haddad
Subject: Garbage Collection and Games
Date: Fri, 17 Dec 2004 22:12:45 +0000
Message-ID: <1103321565.514026.210890@z14g2000cwz.googlegroups.com>

Hi all,
I have recently been doing some research into Garbage Collectors, and
playing around with CMUCL's GC. Since there has been a lot of
disscussion about Lisp and games in here over the last few months, I
figured I'd continue the trend :)
I read the Naughty Dog Postmortem on Jak and Dexter, about some of the
pros and cons of using Lisp for games (well, GOAL anyways), and I
wondered
a) What type of GC would be the best to use in games? mark-sweep,
generational stop-and-copy, incremental tracing, etc. Would a real-time
GC work?
b) Java is used to write games frequently, and it works ok as far as I
know (coming from limited Java experiance here). Java of course has a
GC. How is Java's GC different from most Lisp GCs? Is it really that
different?
c) Is it possibly to use a non-real-time GC and only GC when the player
is at a menu screen or some such? I mean when you write games in C++
you don't have to worry about GCing (then again you have to declare
everything and clean up your mess), so is it possible not to worry
about GCing frequently in games? Or will you need to because of screen
refreshes etc.?
Just some thoughts and questions. I will probably be compiling all of
these ideas into some sort of essay when I get some of the answers, and
learn a little more about GC algorithms and the like.
-- 
Certum quod factum.
Philip Haddad

Re: Garbage Collection and Games Vorax
Re: Garbage Collection and Games Steven M. Haflich
- Re: Garbage Collection and Games Christopher C. Stacy
  - Re: Garbage Collection and Games Duane Rettig
  - Re: Garbage Collection and Games Steven M. Haflich

From: Vorax
Subject: Re: Garbage Collection and Games
Date: Mon, 27 Dec 2004 02:47:20 +0000
Message-ID: <1104115640.764667.244630@z14g2000cwz.googlegroups.com>

Philip Haddad wrote:
> Hi all,
> I have recently been doing some research into Garbage Collectors, and
> playing around with CMUCL's GC. Since there has been a lot of
> disscussion about Lisp and games in here over the last few months, I
> figured I'd continue the trend :)
> I read the Naughty Dog Postmortem on Jak and Dexter, about some of
the
> pros and cons of using Lisp for games (well, GOAL anyways), and I
> wondered
> a) What type of GC would be the best to use in games? mark-sweep,
> generational stop-and-copy, incremental tracing, etc. Would a
real-time
> GC work?
> b) Java is used to write games frequently, and it works ok as far as
I
> know (coming from limited Java experiance here). Java of course has a
> GC. How is Java's GC different from most Lisp GCs? Is it really that
> different?
> c) Is it possibly to use a non-real-time GC and only GC when the
player
> is at a menu screen or some such? I mean when you write games in C++
> you don't have to worry about GCing (then again you have to declare
> everything and clean up your mess), so is it possible not to worry
> about GCing frequently in games? Or will you need to because of
screen
> refreshes etc.?
> Just some thoughts and questions. I will probably be compiling all of
> these ideas into some sort of essay when I get some of the answers,
and
> learn a little more about GC algorithms and the like.
> --
> Certum quod factum.
> Philip Haddad

I am currently writing a Open GL game engine in Java.  I have found
that you can avoid memory allocation in the rendering loop entirely if
you try.  I used to believe that this would be a problem as well.

The best trick I can suggest (and this would work just as well in
Lisp), is creating a pool of temporary member variables within your
objects.  Upon construction, allocate the objects, then use them when
you need temporary variables, or need to send a result back to a
caller, but don't want to mess your internal state.  The next thing to
remember is that the callers should be doing the same thing.

Example:
class Caller {
private Matrix4 m = new Matrix4();
private Matrix4 myMatrix

...

myMatrix.set( matStack.getMultMatrix(myMatrix) );
// do something
...
}

class MatrixStack {
private Matrix4 scratch = new Matrix4();
private Matrix4 internallyUsedMatrix = new Matrix4();

public Matrix4 getMVMatrix(Matrix4 m) {
// math stuff here.
..

scratch.set( myMatrix );

return scratch;
}
}

If its safe, go ahead and use the variable directly to avoid having to
copy the results contents into your local variable.  Try to return safe
values where you can because this avoids any data copying (the
myMatrix.set( ... code above).  Most cases you don't need to copy
anything either because the life of data is only one rendering frame
within the loop and your loop is a very controlled environment (by
you).

My engine currently (though still in development) can maintain 63 fps
(configured to that for synchronization with monitor refresh) with 70K+
tris with 12 animated MD3 chars and 3 active particle systems of 6k
particles without a flicker in the FPS count (monitored for 12 hours).
...If Java can do it, Lisp can to.  The engine is doing all Matrix
calcs on behalf of Open GL as well (so I can track VW coords while
using LC coords for my scene graph), and that would be the biggest risk
for GC (lots of passing around and potential for creating new objects).

From: Steven M. Haflich
Subject: Re: Garbage Collection and Games
Date: Mon, 27 Dec 2004 06:10:10 +0000
Message-ID: <6HNzd.2979$5R.2746@newssvr21.news.prodigy.com>

There seems to be a flawed assumption in this thread that normal
Lisp execution must suspend during gc.  While this is true of
essentially all the main line CL implementations today, there is
no fundamental reason why this must be so.  There have been in
the past gc implementations that worked in parallel without
suspending Lisp execution (at least with worst-possible
suspension being acceptably short) and there is no reason why
modern designs and implementations would not be possible today.

I've been sitting on sucha design for about a decade, and there
are other known designs that are at least as good as the one in
my head.  Here are the issues:

   Designing and implementing a robust GC in a mature and complete
CL implementation is a big job.  The code generator needs to know
a lot about the gc and memory allocator, and vice versa.  All the
changes to make these things work with one another causes a lot
of destabilization.  Square that with the existence of
multithreading.  Oh, and did I mention multithreading?  (If you
don't get the joke, nevermind.)  When and how does it become
worthwhile for a stable implementation to suffer the instability
of such a fundamental change?  And how do you fund it?

   Code generated by the compiler and higher-level code (for
instance, the implementation of hash tables) in certain
critical sections may depend upon atomicity of execution for
certain segments of code.  In a multiprocessor environment these
assumptions may not happen automatically, and enforcing them may
require interlocks or suboptimal code sequences, and these
changes reduce the speed of _all_ code execution.  How much is
the possibility of parallel gc (and perhaps also parallel
multiprocessor execution) be worth?  Would it be worth a 20%
slowdown?  Would it be worth a 50% slowdown? These are the
kinds of estimates postulated by experts who have implemented
existing compilers and gc systems.

Something to think about.

From: Christopher C. Stacy
Subject: Re: Garbage Collection and Games
Date: Mon, 27 Dec 2004 09:18:47 +0000
Message-ID: <uoeggrte0.fsf@news.dtpq.com>

If you want to really hair things up, maybe a multiprocessor
system could actually be used to run the GC in parallel with 
the mutator threads; I wonder if this would be fast, or worse?

From: Duane Rettig
Subject: Re: Garbage Collection and Games
Date: Mon, 27 Dec 2004 17:09:10 +0000
Message-ID: <4652nvfbd.fsf@franz.com>

······@news.dtpq.com (Christopher C. Stacy) writes:

> If you want to really hair things up, maybe a multiprocessor
> system could actually be used to run the GC in parallel with 
> the mutator threads; I wonder if this would be fast, or worse?

Yes :-)

It would first be a requirement - if a gc is not explicitly designed
to run in a thread while other threads are running (on other
processors, of course), then each thread would have to be stopped
while some segment of the gc runs, and it wouldn't be what I consider
to be a true multiprocessor-ready gc.  If, however, other threads are
allowed to run during gc operations, then it is only a question of
configuration as to whether th gc runs in its own thread (preferred)
or in one of the threads that was already running Lisp code.  Either
way, it is both

 - faster:  multiple threads can continue to run without stoppage
from gc operations

 - slower: every access to the lisp heap must take a hardware or
software trap, or an extra indirection, to ensure that the access
is properly forwarded.

This means that individual threads would in general be slower,
but a system of highly parallelized operations would be faster.
So depending on the nature of the application, the overall result
might be faster or slower, to the extent that the application is
itself parallelizable to fit into the number of processors.
However, Amdahl's Law suggests that most applications would be
slower...

As for the tradeoffs on the slower side: the extra indirection
is the most straightforward solution; it takes very little extra
logic on the Lisp code side, but the cost is one indirection for
every access.  This means that non-parallel applications are halved
in speed for access-heavy operation.  The software trap situation
is less constricting to Lisp code, but it requires specific memory
layouts to ensure that the "trap" tests are fast.  I believe that
Steve's solution is one of these.  The hardware trap situation is
by far the least intrusive to normal operation _when_ a page is
not being forwarded, but it has an extremely high cost (on the order
of a hundred or more cycles) when forwarding is required, or, if
an operating system allows the kind of "fast user trap" that would
be necessary to get in and out with automatic forwarding built
in (say, as a level-one trap), the cycle count could be pushed
to something less than a hundred cycles.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182

From: Steven M. Haflich
Subject: Re: Garbage Collection and Games
Date: Tue, 28 Dec 2004 03:12:02 +0000
Message-ID: <6a4Ad.3989$yV1.1855@newssvr14.news.prodigy.com>

Christopher C. Stacy wrote:
> If you want to really hair things up, maybe a multiprocessor
> system could actually be used to run the GC in parallel with 
> the mutator threads; I wonder if this would be fast, or worse?

Parallel gc is implicit in the design, but the combined execution
would still be slower for a single, non-parallelizable mutator
computation.  The argument is that a well-tuned application can
generally run with less than 10% overhead for gc, but the estimates
for parallel gc overhead are generally larger than 10%.

As a limiting case with a single processor, parallel gc designs
degenerate into nonobtrusive gc, where the gc can run in small
processor quanta.  Such a design could still be useful, even
though the overall execution speed would be slowed, because gc-
induced latency could be made quite small and acceptable to
some real-time applications.  Also, the interlocking issues
with only a single processor are much less severe than with
multiple processors, so the speed hit is much less than with
even two processors (mutator and gc).  But I have no numbers
to offer.