Debugging non-reclaimed memory

From: CoffeeMug
Subject: Debugging non-reclaimed memory
Date: Thu, 12 Jun 2008 03:46:51 +0000
Message-ID: <9c74e4b0-2a68-4950-96af-d589c019a6cd@e53g2000hsa.googlegroups.com>

Hello,

I have a program that uses increasingly more memory over time and
eventually crashes because of heap exhaustion. The program performs
many independent steps which, when finished, should leave no
references to any of the allocated objects (this is a web server).
There are few global structures that are populated only at the startup
of the program and they do not hold any references to memory created
in these steps. I am at a loss as to which part of the code allocates
the memory that cannot be reclaimed, and which long-living structures
reference this memory.

SBCL has a number of functions that appear to be useful in tracking
down the culprit. In particular, sb-vm:list-allocated-objects, and sb-
vm::list-referencing-objects. I wanted to try two different strategies
with these functions - creating "checkpoints" by doing a full gc and
calling list-allocated-objects, and then diffing the results to see
which objects allocated between the checkpoints could not be
deallocated; and building a graph of object references, collapsing
cycles into single vertices, and doing a topological sort on the
resulting DAG to figure out which objects hold most of the memory.
Unfortunately the two functions above are rather fragile -  equality
primitives don't work on all the objects returned and result in
strange errors, memory faults occassionally occur, etc. I was unable
to get around these limitations to implement either of the above
strategies.

I am not sure how to proceed at this point. Are there alternaitve (but
more stable) functions on other implementations that I could use? Are
there other tools to debug this sort of thing? How would you go about
solving this problem?

Regards,
- Slava Akhmechet

Re: Debugging non-reclaimed memory Pascal J. Bourguignon
- Re: Debugging non-reclaimed memory CoffeeMug
  - Re: Debugging non-reclaimed memory George Neuner
Re: Debugging non-reclaimed memory Rainer Joswig
Re: Debugging non-reclaimed memory =?UTF-8?B?TGFycyBSdW5lIE7DuHN0ZGFs?=
- Re: Debugging non-reclaimed memory George Neuner

From: Pascal J. Bourguignon
Subject: Re: Debugging non-reclaimed memory
Date: Thu, 12 Jun 2008 05:58:56 +0000
Message-ID: <87abhr8a8v.fsf@hubble.informatimago.com>

CoffeeMug <·········@gmail.com> writes:

> Hello,
>
> I have a program that uses increasingly more memory over time and
> eventually crashes because of heap exhaustion. The program performs
> many independent steps which, when finished, should leave no
> references to any of the allocated objects (this is a web server).
> There are few global structures that are populated only at the startup
> of the program and they do not hold any references to memory created
> in these steps. I am at a loss as to which part of the code allocates
> the memory that cannot be reclaimed, and which long-living structures
> reference this memory.
>
> SBCL has a number of functions that appear to be useful in tracking
> down the culprit. In particular, sb-vm:list-allocated-objects, and sb-
> vm::list-referencing-objects. I wanted to try two different strategies
> with these functions - creating "checkpoints" by doing a full gc and
> calling list-allocated-objects, and then diffing the results to see
> which objects allocated between the checkpoints could not be
> deallocated; and building a graph of object references, collapsing
> cycles into single vertices, and doing a topological sort on the
> resulting DAG to figure out which objects hold most of the memory.
> Unfortunately the two functions above are rather fragile -  equality
> primitives don't work on all the objects returned and result in
> strange errors, memory faults occassionally occur, etc. I was unable
> to get around these limitations to implement either of the above
> strategies.
>
> I am not sure how to proceed at this point. Are there alternaitve (but
> more stable) functions on other implementations that I could use? Are
> there other tools to debug this sort of thing? How would you go about
> solving this problem?

Keeping these lists in the image would prevent the garbage collector
to collect these items...  You'd have to dump them to files, and diff
the files.

Did yo udo a deep garbage collection? AFAIK, SBCL garbage collector
only collects garbage from the youngest generation. (But I guess it
would do automatically a deep garbage collection on memory
exhaustion).

Also, note that on X86, it is conservative:
http://sbcl-internals.cliki.net/GENCGC 
This could explain some leaking.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

"Indentation! -- I will show you how to indent when I indent your skull!"

From: CoffeeMug
Subject: Re: Debugging non-reclaimed memory
Date: Thu, 12 Jun 2008 17:13:30 +0000
Message-ID: <c683226f-dac5-4cb7-892e-afdbf95fa1be@m73g2000hsh.googlegroups.com>

On Jun 12, 1:58 am, ····@informatimago.com (Pascal J. Bourguignon)
wrote:
> Keeping these lists in the image would prevent the garbage collector
> to collect these items...  You'd have to dump them to files, and diff
> the files.
I don't think this is a problem. I create a checkpoint by doing a full
gc, then grabbing a list of objects. I then run some more code, and do
another checkpoint in the same manner. If I do a diff now, it should
give me objects allocatted between checkpoint one and checkpoint two
that could not be reclaimed (note that I do a gc before I generate the
list, so presumably objects in the list are already referenced by
someone else).

A bigger problem is allocatting things on the heap while walking the
heap (calling list-allocated-objects does exactly that). I think this
is why these functions are so unstable.

> Did yo udo a deep garbage collection?
Yes.

> Also, note that on X86, it is conservative:http://sbcl-internals.cliki.net/GENCGC
I think this could explain minor "accidental" leaking, but not
consistent large leaks I am seeing. I've also verified this behavior
on other implementations.

From: George Neuner
Subject: Re: Debugging non-reclaimed memory
Date: Fri, 13 Jun 2008 05:45:22 +0000
Message-ID: <jm0454lmq9lsbsvi44ua5m73mpo6d58l80@4ax.com>

On Thu, 12 Jun 2008 10:13:30 -0700 (PDT), CoffeeMug
<·········@gmail.com> wrote:

>On Jun 12, 1:58�am, ····@informatimago.com (Pascal J. Bourguignon)
>wrote:
>> Keeping these lists in the image would prevent the garbage collector
>> to collect these items... �You'd have to dump them to files, and diff
>> the files.
>
>I don't think this is a problem. I create a checkpoint by doing a full
>gc, then grabbing a list of objects. I then run some more code, and do
>another checkpoint in the same manner. If I do a diff now, it should
>give me objects allocatted between checkpoint one and checkpoint two
>that could not be reclaimed (note that I do a gc before I generate the
>list, so presumably objects in the list are already referenced by
>someone else).

Yes, but those references might not still exist at the second GC, in
which case being in the list will prevent the objects being collected.

You really need to somehow dump the objects to a file, free the list
and do another GC to start the next allocation round fresh.

I'm not sure how to get an accurate picture with a moving collector
unless you somehow tag the objects with a unique ID.  With a
non-moving collector you can just dump the object addresses.

>A bigger problem is allocatting things on the heap while walking the
>heap (calling list-allocated-objects does exactly that). I think this
>is why these functions are so unstable.

I don't use SBCL so this might be an unworkable suggestion, but if
there is a way to tie into and observe the GC heap walk you could dump
live objects one at a time as they are encountered.

>> Did yo udo a deep garbage collection?
>Yes.
>
>> Also, note that on X86, it is conservative:http://sbcl-internals.cliki.net/GENCGC
>I think this could explain minor "accidental" leaking, but not
>consistent large leaks I am seeing. I've also verified this behavior
>on other implementations.

George
--
for email reply remove "/" from address

From: Rainer Joswig
Subject: Re: Debugging non-reclaimed memory
Date: Fri, 13 Jun 2008 07:31:56 +0000
Message-ID: <joswig-0B99EE.09315613062008@news-europe.giganews.com>

In article 
<····································@e53g2000hsa.googlegroups.com>,
 CoffeeMug <·········@gmail.com> wrote:

> Hello,
> 
> I have a program that uses increasingly more memory over time and
> eventually crashes because of heap exhaustion. The program performs
> many independent steps which, when finished, should leave no
> references to any of the allocated objects (this is a web server).

There is nothing that makes a web server so. A 'tuned'
web server has caches that are filled/used during runtime.
Then there are continuation-based applications that
may store lots of stuff in the continuation.

> There are few global structures that are populated only at the startup
> of the program and they do not hold any references to memory created
> in these steps. I am at a loss as to which part of the code allocates
> the memory that cannot be reclaimed, and which long-living structures
> reference this memory.
> 
> SBCL has a number of functions that appear to be useful in tracking
> down the culprit. In particular, sb-vm:list-allocated-objects, and sb-
> vm::list-referencing-objects. I wanted to try two different strategies
> with these functions - creating "checkpoints" by doing a full gc and
> calling list-allocated-objects, and then diffing the results to see
> which objects allocated between the checkpoints could not be
> deallocated; and building a graph of object references, collapsing
> cycles into single vertices, and doing a topological sort on the
> resulting DAG to figure out which objects hold most of the memory.
> Unfortunately the two functions above are rather fragile -  equality
> primitives don't work on all the objects returned and result in
> strange errors, memory faults occassionally occur, etc. I was unable
> to get around these limitations to implement either of the above
> strategies.
> 
> I am not sure how to proceed at this point. Are there alternaitve (but
> more stable) functions on other implementations that I could use? Are
> there other tools to debug this sort of thing? How would you go about
> solving this problem?


I'm not a user of your software combination, but some general
remarks. Not sure if they are helpful.

There are many places where the problem may occur and one
would try to find the 'area' which might be responsible:

* is it cross platform (Lisp implementation) or not?
* does it happen with another web server?
* is it a problem of a particular version of the Lisp system,
  the web server or some libraries.
* the FFI interface to do the network calls might leak memory
  -> try to replay the requests without network code and
  see if it still happens
* the (conservative) GC might not reclaim all memory, or
  might not reclaim certain kinds of memory. I'm not sure
  I wanted to use a conservative GC for a long running server
  app without having a wizard around able to check for problems
* caches of the web server (for users, sessions, connections, buffers,
  pages, html snippets, header lines, ...)
* input/output buffers
* in-memory logs of the web server
* lisp system data structures (symbol table, ...)
* OS resources, the web server could use some OS resources
  (threads, connections, file handles, ...) without ever freeing them
* continuation-based servers can potentially have a problem. It depends
  on what the continuation keeps between requests and
  how long the data is kept. There could also be a leak
  that data of the previous request (say, assembled pages)
  is still around
* it could be in application code. Run it without network code,
  then without web server code invoked and see if the
  application code is responsible.


Watch out for global data structures that are filled with data
(resources, logs, ...).
Closures/continuations also could keep lots of data 'alive'.
One closure keeps the next closure and so on.
I would also check if runtime compilation happens and if that's
a problem.



> 
> Regards,
> - Slava Akhmechet

-- 
http://lispm.dyndns.org/

From: =?UTF-8?B?TGFycyBSdW5lIE7DuHN0ZGFs?=
Subject: Re: Debugging non-reclaimed memory
Date: Sun, 15 Jun 2008 08:24:05 +0000
Message-ID: <4854d1a3$0$2330$c83e3ef6@nn1-read.tele2.net>

CoffeeMug wrote:
> Hello,
> 
> I have a program that uses increasingly more memory over time and
> eventually crashes because of heap exhaustion. The program performs
> many independent steps which, when finished, should leave no
> references to any of the allocated objects (this is a web server).
> There are few global structures that are populated only at the startup
> of the program and they do not hold any references to memory created
> in these steps. I am at a loss as to which part of the code allocates
> the memory that cannot be reclaimed, and which long-living structures
> reference this memory.
> 
> SBCL has a number of functions that appear to be useful in tracking
> down the culprit. In particular, sb-vm:list-allocated-objects, and sb-
> vm::list-referencing-objects. I wanted to try two different strategies
> with these functions - creating "checkpoints" by doing a full gc and
> calling list-allocated-objects, and then diffing the results to see
> which objects allocated between the checkpoints could not be
> deallocated; and building a graph of object references, collapsing
> cycles into single vertices, and doing a topological sort on the
> resulting DAG to figure out which objects hold most of the memory.
> Unfortunately the two functions above are rather fragile -  equality
> primitives don't work on all the objects returned and result in
> strange errors, memory faults occassionally occur, etc. I was unable
> to get around these limitations to implement either of the above
> strategies.
> 
> I am not sure how to proceed at this point. Are there alternaitve (but
> more stable) functions on other implementations that I could use? Are
> there other tools to debug this sort of thing? How would you go about
> solving this problem?
> 
> Regards,
> - Slava Akhmechet


What about adding finalization hooks to some of the data you expect to be
removed after a (sb-ext:gc :full t) and see before vs. after?

Store references to the objects in a weak hash (see the :WEAKNESS keyarg
for make-hash-table) and see what's left after a full GC.

..I have no idea if this'll work..

-- 
Lars Rune Nøstdal
http://nostdal.org/

From: George Neuner
Subject: Re: Debugging non-reclaimed memory
Date: Mon, 16 Jun 2008 05:16:40 +0000
Message-ID: <t7tb54h1foejla919ngu16j36sie8cjp18@4ax.com>

On Sun, 15 Jun 2008 10:24:05 +0200, Lars Rune N�stdal
<···········@gmail.com> wrote:

>CoffeeMug wrote:
>> Hello,
>> 
>> I have a program that uses increasingly more memory over time and
>> eventually crashes because of heap exhaustion. 
>
>
>What about adding finalization hooks to some of the data you expect to be
>removed after a (sb-ext:gc :full t) and see before vs. after?
>
>Store references to the objects in a weak hash (see the :WEAKNESS keyarg
>for make-hash-table) and see what's left after a full GC.
>
>..I have no idea if this'll work..

I don't know about SBCL, but in many GC implementations, it takes 2
collections to recycle objects that have finalizers.

George
--
for email reply remove "/" from address