Data structure size can o' worms

From: Eric
Subject: Data structure size can o' worms
Date: Thu, 05 Jun 2003 20:24:47 +0000
Message-ID: <ac4f4371.0306051224.6d15f751@posting.google.com>

Howdy All,

We have an application which keeps a fairly complex amount of logging
information as it goes along. The problem is that over time this
causes our Lisp to bloat and die and I need to analyze this situation
so that we can make some hard decisions about how to fix the
situation.  This is one of the cases where our "customers" want to
have their cake and eat it to. So I have to be able to provide some
arguments about the costs of the different types of approaches.  Which
brings me to my seemingly simple question...

Given some variable *foo* how can I calculate the size of the data
structure rooted there?

Now before you tell me how nonsensical that notion even is I
understand that this is fraught with problems.  For instance if there
is a symbol there I could track it back to its package and then
basically hit every live object in the lisp image.  What I need is
something that will Do-What-I-Mean so that if I call (how-big '(a b
c)) I would get something like "3 symbols + 3 conses" or "96 whatsits"
the actual unit of measurement is not that important.  Some other
criteria...

* It should handle circular and multiple references correctly and not
count them again and again.

* It *need not* answer the question of what is reachable through other
paths in the image.  I don't need to know how much of it would be
collected as garbage if my initial link was broken.

* It can be really slow.

* It has to be CLOS friendly.

I am hoping that someone has a suggestion that will save me from
having to implement this thing on my own.  I think that there are most
certainly a set of nasty corner cases here which hopefully someone
else has already addressed.

Thanks a bunch!

-Eric

Re: Data structure size can o' worms Barry Margolin
Re: Data structure size can o' worms Steven M. Haflich
Re: Data structure size can o' worms Sam Steingold
Re: Data structure size can o' worms Joe Marshall
- Re: Data structure size can o' worms Duane Rettig
- Re: Data structure size can o' worms Barry Margolin
  - Re: Data structure size can o' worms Joe Marshall
    - Re: Data structure size can o' worms Barry Margolin
      - Re: Data structure size can o' worms Joe Marshall
        Re: Data structure size can o' worms Barry Margolin
        Re: Data structure size can o' worms Joe Marshall
        Re: Data structure size can o' worms Barry Margolin
        Re: Data structure size can o' worms Kent M Pitman
        Re: Data structure size can o' worms Joe Marshall
- Re: Data structure size can o' worms Rob Warnock
  - Re: Data structure size can o' worms Rob Warnock

From: Barry Margolin
Subject: Re: Data structure size can o' worms
Date: Thu, 05 Jun 2003 20:40:56 +0000
Message-ID: <sVNDa.41$9Q2.279@paloalto-snr1.gtei.net>

In article <····························@posting.google.com>,
Eric <····@twincitizen.net> wrote:
>Given some variable *foo* how can I calculate the size of the data
>structure rooted there?
>
>Now before you tell me how nonsensical that notion even is I
>understand that this is fraught with problems.  For instance if there
>is a symbol there I could track it back to its package and then
>basically hit every live object in the lisp image.  What I need is
>something that will Do-What-I-Mean so that if I call (how-big '(a b
>c)) I would get something like "3 symbols + 3 conses" or "96 whatsits"
>the actual unit of measurement is not that important.  Some other
>criteria...

You'll need to decide for each data type, which of its slots you want to
descend into to find more dependent objects.  E.g. for a cons you should
descend into the CAR and CDR, for a symbol it's probably sufficient for
your needs to descend only into the property list.

>* It should handle circular and multiple references correctly and not
>count them again and again.

Create an EQL hash table.  Perform a depth-first search starting from the
object you're given.  Check whether the object is in the hash table; if it
isn't, add it and recurse into it.

When you're done, count the objects in the hash table.

>* It *need not* answer the question of what is reachable through other
>paths in the image.  I don't need to know how much of it would be
>collected as garbage if my initial link was broken.
>
>* It can be really slow.
>
>* It has to be CLOS friendly.

You'll need to use the MOP to enumerate the slots and descend into them.
Also, it's likely to be difficult to determine which are the "interesting"
slots to descend into.  Classes that are specific to your application can
be special-cased so you do the right thing, but other classes will have to
be handled by a generic catch-all.

-- 
Barry Margolin, ··············@level3.com
Level(3), Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: Steven M. Haflich
Subject: Re: Data structure size can o' worms
Date: Sat, 07 Jun 2003 02:19:23 +0000
Message-ID: <3EE14C12.5030807@alum.mit.edu>

Eric wrote:

> Given some variable *foo* how can I calculate the size of the data
> structure rooted there?

The size of the data structure rooted there is always 42.

Everyone is about to point out that there is no good way to do this,
and everyone is essentially right.  You need to make too many decisions
how to factor shared data:  An instance of a class ought be credited
with some share of the class and superclass metadata, according to the
number of instances.  If your data structure is interns many symbols,
those symbols need to be counted as well as the package tables that
intern them.  Shared structure (even cons trees) needs be apportioned
among the distinct containing structures.  If your data is treelike
with no shared branches, the question may be tractible.  If there are
reentrant branches, any algorithmic answer may be misleading.

In some implementations, conses occupy different amounts of space
depending on whether they are in new or old space, or depending
whether they are cdr coded.  Ditto for strings in some implementations.

Another way to approach the problem is to ask the implementation for
the total occupied heap size (which cl:room may be able to tell you)
and then compare results after accumulating some log info.

From: Sam Steingold
Subject: Re: Data structure size can o' worms
Date: Sat, 07 Jun 2003 15:28:53 +0000
Message-ID: <m3smqlrjnv.fsf@loiso.podval.org>

> * In message <····························@posting.google.com>
> * On the subject of "Data structure size can o' worms"
> * Sent on 5 Jun 2003 13:24:47 -0700
> * Honorable ····@twincitizen.net (Eric) writes:
>
> Given some variable *foo* how can I calculate the size of the data
> structure rooted there?

everyone will repeat the same thing (which I agree with) - you cannot
get a good answer here.

CLISP offers a macro SPACE (renamed to EXT:TIMES - for TIME & Space - in
the CVS due to ANSI compliance issues) which provides detailed
information about the data allocation.

RELEASE:
<http://clisp.cons.org/impnotes/environment-dict.html#space>
CVS HEAD:
<ftp://cvs2.cons.org/pub/lisp/clisp/snapshots/impnotes/environment-dict.html#time>

-- 
Sam Steingold (http://www.podval.org/~sds) running RedHat9 GNU/Linux
<http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/>
<http://www.mideasttruth.com/> <http://www.palestine-central.com/links.html>
Do not tell me what to do and I will not tell you where to go.

From: Joe Marshall
Subject: Re: Data structure size can o' worms
Date: Mon, 09 Jun 2003 16:45:34 +0000
Message-ID: <u1az6vyp.fsf@ccs.neu.edu>

····@twincitizen.net (Eric) writes:

> Given some variable *foo* how can I calculate the size of the data
> structure rooted there?
>
> * It *need not* answer the question of what is reachable through other
> paths in the image.  I don't need to know how much of it would be
> collected as garbage if my initial link was broken.

It seems to me that if you are interested in the `aggregate' amount of
storage held by some variable, then you *do* want to know how much of
it would be collected as garbage if the initial link is broken.

Suppose that the system has done a full GC.  Now you SETQ *foo* to NIL
and perform another full GC.  It seems to me that the storage
reclaimed is *exactly* the `size' of *foo*.

From: Duane Rettig
Subject: Re: Data structure size can o' worms
Date: Mon, 09 Jun 2003 19:14:10 +0000
Message-ID: <4y90bukql.fsf@beta.franz.com>

Joe Marshall <···@ccs.neu.edu> writes:

> ····@twincitizen.net (Eric) writes:
> 
> > Given some variable *foo* how can I calculate the size of the data
> > structure rooted there?
> >
> > * It *need not* answer the question of what is reachable through other
> > paths in the image.  I don't need to know how much of it would be
> > collected as garbage if my initial link was broken.
> 
> It seems to me that if you are interested in the `aggregate' amount of
> storage held by some variable, then you *do* want to know how much of
> it would be collected as garbage if the initial link is broken.
> 
> Suppose that the system has done a full GC.  Now you SETQ *foo* to NIL
> and perform another full GC.  It seems to me that the storage
> reclaimed is *exactly* the `size' of *foo*.

But beware the Heisenberg Effect; you'd have to know the precise
relationship of your lisp's ephemeral heap space with its global
heap space (and the reporting accuracy of the lisp on each), and
in fact the consing characteristice of the test itself.  Most people
who run tests like this tend to forget at least once that their
interpreter probably conses, and thus a good test must be fully
run within compiled code.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182

From: Barry Margolin
Subject: Re: Data structure size can o' worms
Date: Mon, 09 Jun 2003 17:20:53 +0000
Message-ID: <Vl3Fa.12$h8.151@paloalto-snr1.gtei.net>

In article <············@ccs.neu.edu>, Joe Marshall  <···@ccs.neu.edu> wrote:
>It seems to me that if you are interested in the `aggregate' amount of
>storage held by some variable, then you *do* want to know how much of
>it would be collected as garbage if the initial link is broken.
>
>Suppose that the system has done a full GC.  Now you SETQ *foo* to NIL
>and perform another full GC.  It seems to me that the storage
>reclaimed is *exactly* the `size' of *foo*.

(setq *foo* <something>)
(setq *bar* *foo*)

By your logic, the sizes of *foo* and *bar* are each zero.

-- 
Barry Margolin, ··············@level3.com
Level(3), Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: Joe Marshall
Subject: Re: Data structure size can o' worms
Date: Mon, 09 Jun 2003 19:06:36 +0000
Message-ID: <isrf6pfn.fsf@ccs.neu.edu>

Barry Margolin <··············@level3.com> writes:

> In article <············@ccs.neu.edu>, Joe Marshall  <···@ccs.neu.edu> wrote:
> >It seems to me that if you are interested in the `aggregate' amount of
> >storage held by some variable, then you *do* want to know how much of
> >it would be collected as garbage if the initial link is broken.
> >
> >Suppose that the system has done a full GC.  Now you SETQ *foo* to NIL
> >and perform another full GC.  It seems to me that the storage
> >reclaimed is *exactly* the `size' of *foo*.
> 
> (setq *foo* <something>)
> (setq *bar* *foo*)
> 
> By your logic, the sizes of *foo* and *bar* are each zero.

Good point!

Ok, let me refine my argument a little.  Since we were discussing the
size of the structure that *foo* names (i.e., that the symbol *foo* is
pointing to, rather than the symbol itself), then setting *foo* to NIL
isn't enough, we really need to break *every* link to the structure
that *foo* points to.

And now I'll say that the symbol `*foo*' may be a pretty small thing,
but what it contains (or rather, did contain before we gc'd it), is
whatever.

From: Barry Margolin
Subject: Re: Data structure size can o' worms
Date: Mon, 09 Jun 2003 19:47:58 +0000
Message-ID: <Ov5Fa.13$h8.210@paloalto-snr1.gtei.net>

In article <············@ccs.neu.edu>, Joe Marshall  <···@ccs.neu.edu> wrote:
>Barry Margolin <··············@level3.com> writes:
>
>> In article <············@ccs.neu.edu>, Joe Marshall  <···@ccs.neu.edu> wrote:
>> >It seems to me that if you are interested in the `aggregate' amount of
>> >storage held by some variable, then you *do* want to know how much of
>> >it would be collected as garbage if the initial link is broken.
>> >
>> >Suppose that the system has done a full GC.  Now you SETQ *foo* to NIL
>> >and perform another full GC.  It seems to me that the storage
>> >reclaimed is *exactly* the `size' of *foo*.
>> 
>> (setq *foo* <something>)
>> (setq *bar* *foo*)
>> 
>> By your logic, the sizes of *foo* and *bar* are each zero.
>
>Good point!
>
>Ok, let me refine my argument a little.  Since we were discussing the
>size of the structure that *foo* names (i.e., that the symbol *foo* is
>pointing to, rather than the symbol itself), then setting *foo* to NIL
>isn't enough, we really need to break *every* link to the structure
>that *foo* points to.

But you would have to do this recursively.  Consider this case:

(setq *foo* (cons <expr1> <expr2>))
(setq *bar* (cons (car *foo*) (cdr *foo*)))

Now what's the size of *foo* and *bar*?Setting either of the variables to
nil and GC'ing would only reclaim the cons cell, not the larger data
structures that they refer to.

Since what you would have to set to nil is recursive, it's essentially the
same as just walking the data structures recursively and adding up the
sizes.

-- 
Barry Margolin, ··············@level3.com
Level(3), Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: Joe Marshall
Subject: Re: Data structure size can o' worms
Date: Mon, 09 Jun 2003 20:00:44 +0000
Message-ID: <ptln58cz.fsf@ccs.neu.edu>

Barry Margolin <··············@level3.com> writes:

> In article <············@ccs.neu.edu>, Joe Marshall  <···@ccs.neu.edu> wrote:
> >
> >Ok, let me refine my argument a little.  Since we were discussing the
> >size of the structure that *foo* names (i.e., that the symbol *foo* is
> >pointing to, rather than the symbol itself), then setting *foo* to NIL
> >isn't enough, we really need to break *every* link to the structure
> >that *foo* points to.
> 
> But you would have to do this recursively.  Consider this case:
> 
> (setq *foo* (cons <expr1> <expr2>))
> (setq *bar* (cons (car *foo*) (cdr *foo*)))
> 
> Now what's the size of *foo* and *bar*?  Setting either of the variables to
> nil and GC'ing would only reclaim the cons cell, not the larger data
> structures that they refer to.

I think I'd be happy saying that both *foo* and *bar* are size 2 (or
whatever the size of a cons cell is) in this case.  It is true that
they are interrelated, so deleting both could result in a savings of
storage much larger than either.

Of course this makes a very non-intuitive effect.  The size of *FOO*
can change if you SETQ *BAR*.

From: Barry Margolin
Subject: Re: Data structure size can o' worms
Date: Mon, 09 Jun 2003 20:09:06 +0000
Message-ID: <BP5Fa.15$h8.195@paloalto-snr1.gtei.net>

In article <············@ccs.neu.edu>, Joe Marshall  <···@ccs.neu.edu> wrote:
>Barry Margolin <··············@level3.com> writes:
>
>> In article <············@ccs.neu.edu>, Joe Marshall  <···@ccs.neu.edu> wrote:
>> >
>> >Ok, let me refine my argument a little.  Since we were discussing the
>> >size of the structure that *foo* names (i.e., that the symbol *foo* is
>> >pointing to, rather than the symbol itself), then setting *foo* to NIL
>> >isn't enough, we really need to break *every* link to the structure
>> >that *foo* points to.
>> 
>> But you would have to do this recursively.  Consider this case:
>> 
>> (setq *foo* (cons <expr1> <expr2>))
>> (setq *bar* (cons (car *foo*) (cdr *foo*)))
>> 
>> Now what's the size of *foo* and *bar*?  Setting either of the variables to
>> nil and GC'ing would only reclaim the cons cell, not the larger data
>> structures that they refer to.
>
>I think I'd be happy saying that both *foo* and *bar* are size 2 (or
>whatever the size of a cons cell is) in this case.  It is true that
>they are interrelated, so deleting both could result in a savings of
>storage much larger than either.
>
>Of course this makes a very non-intuitive effect.  The size of *FOO*
>can change if you SETQ *BAR*.

It also means that (size-of *foo*) may be less than (size-of (car *foo*)),
another non-intuitive concept.

-- 
Barry Margolin, ··············@level3.com
Level(3), Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: Joe Marshall
Subject: Re: Data structure size can o' worms
Date: Mon, 09 Jun 2003 20:56:43 +0000
Message-ID: <k7bv55ro.fsf@ccs.neu.edu>

Barry Margolin <··············@level3.com> writes:

> In article <············@ccs.neu.edu>, Joe Marshall  <···@ccs.neu.edu> wrote:
> >Barry Margolin <··············@level3.com> writes:
> >
> >> In article <············@ccs.neu.edu>, Joe Marshall  <···@ccs.neu.edu> wrote:
> >> >
> >> >Ok, let me refine my argument a little.  Since we were discussing the
> >> >size of the structure that *foo* names (i.e., that the symbol *foo* is
> >> >pointing to, rather than the symbol itself), then setting *foo* to NIL
> >> >isn't enough, we really need to break *every* link to the structure
> >> >that *foo* points to.
> >> 
> >> But you would have to do this recursively.  Consider this case:
> >> 
> >> (setq *foo* (cons <expr1> <expr2>))
> >> (setq *bar* (cons (car *foo*) (cdr *foo*)))
> >> 
> >> Now what's the size of *foo* and *bar*?  Setting either of the variables to
> >> nil and GC'ing would only reclaim the cons cell, not the larger data
> >> structures that they refer to.
> >
> >I think I'd be happy saying that both *foo* and *bar* are size 2 (or
> >whatever the size of a cons cell is) in this case.  It is true that
> >they are interrelated, so deleting both could result in a savings of
> >storage much larger than either.
> >
> >Of course this makes a very non-intuitive effect.  The size of *FOO*
> >can change if you SETQ *BAR*.
> 
> It also means that (size-of *foo*) may be less than (size-of (car *foo*)),
> another non-intuitive concept.

That one I'm comfortable with.

From: Barry Margolin
Subject: Re: Data structure size can o' worms
Date: Mon, 09 Jun 2003 21:51:39 +0000
Message-ID: <Lj7Fa.19$h8.33@paloalto-snr1.gtei.net>

In article <············@ccs.neu.edu>, Joe Marshall  <···@ccs.neu.edu> wrote:
>Barry Margolin <··············@level3.com> writes:
>> It also means that (size-of *foo*) may be less than (size-of (car *foo*)),
>> another non-intuitive concept.
>
>That one I'm comfortable with. 

Not me.  I think that if the car and cdr don't refer to any common objects,
that (size-of *foo*) should be (+ (size-of (car *foo*)) (size-of (cdr
*foo*))).

-- 
Barry Margolin, ··············@level3.com
Level(3), Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: Kent M Pitman
Subject: Re: Data structure size can o' worms
Date: Mon, 09 Jun 2003 22:25:39 +0000
Message-ID: <sfwadcq3n30.fsf@shell01.TheWorld.com>

Barry Margolin <··············@level3.com> writes:

> In article <············@ccs.neu.edu>, Joe Marshall  <···@ccs.neu.edu> wrote:
> >Barry Margolin <··············@level3.com> writes:
> >> It also means that (size-of *foo*) may be less than (size-of (car *foo*)),
> >> another non-intuitive concept.
> >
> >That one I'm comfortable with. 
> 
> Not me.  I think that if the car and cdr don't refer to any common objects,
> that (size-of *foo*) should be (+ (size-of (car *foo*)) (size-of (cdr
> *foo*))).

What about compact string representations?  What if there's an immediate
datatype for small byte-coded string.  perhaps, for example, strings of
1-3 letters with no special attributes could be encoded in an immediate 
pointer.  what if, then, the size of each character objects that appeared
to be the string's contents was the same as the size of the object?
what if under cdr-coding, the size of NIL is a full pointer and the size
of 7 is a full pointer, but the size of (list x) is not 2, but 1, since
cdr-codes mean that it can't be consed.

I don't think that anyone should be doing size-of in the first place, but
if they are, then certainly you have to expect weird math if any kind of
smart representations are used.

From: Joe Marshall
Subject: Re: Data structure size can o' worms
Date: Tue, 10 Jun 2003 16:17:26 +0000
Message-ID: <el216h61.fsf@ccs.neu.edu>

Barry Margolin <··············@level3.com> writes:

> Not me.  I think that if the car and cdr don't refer to any common objects,
> that (size-of *foo*) should be (+ (size-of (car *foo*)) (size-of (cdr
> *foo*))).
 
+ (size-of cons-cell)  ?

From: Rob Warnock
Subject: Re: Data structure size can o' worms
Date: Sat, 14 Jun 2003 11:45:58 +0000
Message-ID: <EICcnf6L_4Vrl3ajXTWc-w@speakeasy.net>

Joe Marshall  <···@ccs.neu.edu> wrote:
+---------------
| Suppose that the system has done a full GC.  Now you SETQ *foo* to NIL
| and perform another full GC.  It seems to me that the storage
| reclaimed is *exactly* the `size' of *foo*.
+---------------

Careful!! Don't forget the builtin "history" variables in the REPL.
You probably want to do something like this:

	> (gc :full t)
	> (room)		; size while *foo* still holds a reference
	> (setq *foo* nil)	; zap *foo* and * and /
	> (setq *foo* nil)	; zap ** and //
	> (setq *foo* nil)	; zap *** and ///
	> (gc :full t)
	> (room)		; size after *foo* is gone

[I know you know all of this, but CL newbies (such as myself!)
sometimes overlook it when trying to do GC measurements...]


-Rob

-----
Rob Warnock, PP-ASEL-IA		<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: Rob Warnock
Subject: Re: Data structure size can o' worms
Date: Sat, 14 Jun 2003 12:09:07 +0000
Message-ID: <ckudnY9b1J3-jXajXTWc-g@speakeasy.net>

Yikes! I just wrote:
+---------------
| Joe Marshall  <···@ccs.neu.edu> wrote:
| +---------------
| | Suppose that the system has done a full GC.  Now you SETQ *foo* to NIL
| | and perform another full GC.  It seems to me that the storage
| | reclaimed is *exactly* the `size' of *foo*.
| +---------------
| 
| Careful!! Don't forget the builtin "history" variables in the REPL.
| You probably want to do something like this:
| 
| 	> (gc :full t)
| 	> (room)		; size while *foo* still holds a reference
| 	> (setq *foo* nil)	; zap *foo* and * and /
| 	> (setq *foo* nil)	; zap ** and //
| 	> (setq *foo* nil)	; zap *** and ///
| 	> (gc :full t)
| 	> (room)		; size after *foo* is gone
| 
| [I know you know all of this, but CL newbies (such as myself!)
| sometimes overlook it when trying to do GC measurements...]
+---------------

And of course, it just *had* to have a bug, didn't it?!?  ;-}  ;-}
The REPL variables may have had some big junk in them *before* the
first GC, and therefore the first ROOM may print an overly-large
value. I think you may need something like this instead:

  	> (setq *** nil /// nil
		** nil // nil)	; * and / will be taken care of by this, too.
  	> (gc :full t)		; values of GC & ROOM not defined, but
				; probably safe [NIL, T, (values), etc.]).
  	> (room)		; size while *foo* still holds a reference
  	> (setq *foo* nil)	; zap *foo* and * and /
  	> (gc :full t)
  	> (room)		; size after *foo* is gone

If you're manually typing this at the REPL, you can't do much about
+++, ++, +, and -, but at least they're known small constants, given
the above forms. [Well, actually you *can*, if you put of all the above
(and a little more) in a function that clears them out, but...]


-Rob

-----
Rob Warnock, PP-ASEL-IA		<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607