data bloat in CL

From: Thomas M. Breuel
Subject: data bloat in CL
Date: Sat, 20 Aug 1994 23:23:40 +0000
Message-ID: <TMB.94Aug21012340@arolla.idiap.ch>

In article <··········@info-server.bbn.com> ·····@labs-n.bbn.com writes:
|In article <·················@arolla.idiap.ch> ···@arolla.idiap.ch (Thomas M. Breuel) writes:
|--> (think about how much space your
|--> typical "struct { int x; double y; char z;};" takes as a CommonLisp
|--> DEFSTRUCT)
|
|my guess is that it would be 128 bits. 4 words, expecting word-alignment
|behavior (well, that's machine-dependent stuff, I was thinking of a
|32-bit machine).
|
|sounds like you're suggesting that it comes out otherwise...

The space required for that kind of object in CommonLisp is usually at
least 6 (32bit-)words: you pay an extra word for the pointer to the
structure and at least an extra word for its header.  In many
implementations, you will also pay another word for a pointer to the
double, and the double pointed to may even have an additional header.

Here are some other cases to think about and the minimal overhead I
think you would get on a 32bit machines for most if not all CommonLisp
implementations (even with full declarations and some optimistic
assumptions) vs. most C implementations:

(1)	struct { int x,y; } foo[N];		    - +100% (4N vs. 2N words)

(2)	struct { char x,y,z,q; } foo[N];	    - +600% (6N vs. 1N words)

(3)	struct { float vect1[3],vect2[3]; } foo[N]; - +130% (14N vs. 6N words)

Several areas of AI (natural language processing, speech recognition,
OCR, computer vision, etc.) require more and more being able to
represent and manipulate data structures like the above highly
efficiently.  Other areas where that kind of stuff becomes important
is in graphics, numerical algorithms, and intelligent database
querying.

Is this fixable in CommonLisp?  Well, right now, you can rewrite your
code as if you were porting it from C to FORTRAN (roughly, convert
arrays of structures to structures of arrays) in order to get
reasonable performance in CommonLisp, but that's a big pain.

CommonLisp implementations probably could do much better than they are
doing right now even using existing language constructs.  For example,
they might be able to pack structures just like C and implement
(MAKE-ARRAY ... :ELEMENT-TYPE 'SOME-STRUCTURE-TYPE) in such a way as
to avoid pointers and individual structure headers, and AREF and
friends, in the presence of religious declarations, might be able to
do the right thing in terms of pulling out pointers into the interior
of such arrays (note that, in general, the pointers would have to
carry the structure type information).  But I have yet to see a
CommonLisp implementation that even tries.

				Thomas.

Re: data bloat in CL Jeff Dalton
Re: data bloat in CL Thomas M. Breuel
- Re: data bloat in CL Ken Anderson
- Re: data bloat in CL Ken Anderson
  - Re: data bloat in CL Thomas M. Breuel
Re: data bloat in CL Kelly Murray
- Re: data bloat in CL Lawrence G. Mayka
Re: data bloat in CL Lawrence G. Mayka

From: Jeff Dalton
Subject: Re: data bloat in CL
Date: Mon, 22 Aug 1994 17:33:10 +0000
Message-ID: <Cuy63A.1CF@cogsci.ed.ac.uk>

In article <·················@arolla.idiap.ch> ···@idiap.ch writes:

>CommonLisp implementations probably could do much better than they are
>doing right now even using existing language constructs.  For example,
>they might be able to pack structures just like C and implement
>(MAKE-ARRAY ... :ELEMENT-TYPE 'SOME-STRUCTURE-TYPE) in such a way as
>to avoid pointers and individual structure headers, and AREF and
>friends, in the presence of religious declarations, might be able to
>do the right thing in terms of pulling out pointers into the interior
>of such arrays (note that, in general, the pointers would have to
>carry the structure type information).  But I have yet to see a
>CommonLisp implementation that even tries.

AREF should be able to do _a_ right thing even without declarations,
because it could look at the array to see what it's element type was.

BTW, thanks for writing so informative -- and constructive -- an
article.

-- jeff

From: Thomas M. Breuel
Subject: Re: data bloat in CL
Date: Wed, 24 Aug 1994 20:21:20 +0000
Message-ID: <TMB.94Aug24222120@arolla.idiap.ch>

In article <······················@wheaton.bbn.com> ········@wheaton.bbn.com (Ken Anderson) writes:
|Perhaps, but alignment restrictions can make such structures bigger than
|you might think. [...]
|It is easy to imagine small hunks of C, like the 3 examples above, that
|seem smaller of faster than what one might do in Lisp. However, you
|shouldn't extrapolate this to large programs.

I think a point-by-point reply to this posting would be tedious (send
me mail if you want it). Suffice it to say that the examples I give
are based on extensive experience with large programs in both C/C++
and Lisp/Lisp-like languages. The problems I describe are real and
are precisely why many software developers have been forced to use C,
even though they grew up on Lisp and would prefer a Lisp-like
language.

There is no question that some numerical code can be made to run
efficiently in CommonLisp, so the kind of proof by example you attempt
isn't all that interesting. But in my experience, a lot of numerical
and non-numerical code requires an extraordinary effort to convert
into efficient CommonLisp, more so that just writing it in C, for
exactly the reasons I state.

I also believe that the problems I point out are fixable, using better
compilation techniques and by improving the language in certain areas
(see my message). If I didn't believe that, it wouldn't be worth
talking about it. The first step towards addressing those problems is
to stop denying they exist.

Thomas.

From: Ken Anderson
Subject: Re: data bloat in CL
Date: Thu, 25 Aug 1994 14:41:43 +0000
Message-ID: <KANDERSO.94Aug25144143@wheaton.bbn.com>

In article <·················@arolla.idiap.ch> ···@arolla.idiap.ch (Thomas M. Breuel) writes:

I don't think anyone is denying there are real problems just as there are
in other languages. I believe you are talking from genuine, extensive
experience. However, it comes across as hearsay. Its too bad few people
are willing to get quantitative.

I admit my example is only a beginning. However, you reject it and only
offer a brief description of your experience in exchange. What were you
trying to do? What kind of "extraordinary effort" did it take? What Lisp
were you using, when? Show us the code. If you compare the Lisp and C
versions of my example, they are pretty identical and so probably was the
effort to develop them.

When i write some Lisp code that i think should be fast, i check it. If it
isn't fast, or it conses when it shouldn't, i look into why. I get my
vendor involved when i need to. So far, it hasn't been an extrordinary
effort, just programming.

OK, problems exist. What is the next step? My choice was to understand
what the performance issues are, by measuring them, and eliminating the
ones i could, and getting my vendor to help. The next step after that
might be to get a better packaged, faster, smaller, Lisp. But i don't have
time to do that myself, or wait for one.

--
Ken Anderson
Internet: ·········@bbn.com
BBN ST Work Phone: 617-873-3160
10 Moulton St. Home Phone: 617-643-0157
Mail Stop 6/4a FAX: 617-873-2794
Cambridge MA 02138
USA

From: Ken Anderson
Subject: Re: data bloat in CL
Date: Thu, 25 Aug 1994 15:36:29 +0000
Message-ID: <KANDERSO.94Aug25153629@wheaton.bbn.com>

In article <·················@arolla.idiap.ch> ···@arolla.idiap.ch (Thomas M. Breuel) writes:


   In article <······················@wheaton.bbn.com> ········@wheaton.bbn.com (Ken Anderson) writes:
   |Perhaps, but alignment restrictions can make such structures bigger than
   |you might think. [...]
   |It is easy to imagine small hunks of C, like the 3 examples above, that
   |seem smaller of faster than what one might do in Lisp.  However, you
   |shouldn't extrapolate this to large programs.

   I think a point-by-point reply to this posting would be tedious (send
   me mail if you want it).  Suffice it to say that the examples I give
   are based on extensive experience with large programs in both C/C++
   and Lisp/Lisp-like languages.  The problems I describe are real and
   are precisely why many software developers have been forced to use C,
   even though they grew up on Lisp and would prefer a Lisp-like
   language.

This got me thinking about my application.  I realize it is probably quite
different from your, but maybe you can help me work up a way of comparing
data structure issues in C and Lisp for realistic applications.  Also,
perhaps this will prompt you and others will provide valuable data about
their applications.

I've included my code below, but here is a summary.  We have a composit
object consisting of about 225,000 objects from 25 classes, and about 1.7
Mega words total in size.

Exhibit one shows the effect of removing various overheads.  The best i
could do, after removing all overhead on everything but builtin objects,
was about 25%.  Doing a detailed convertion to C++, might give you 30%, but
you'd have to add back in some overhead, so maybe 15 to 20% is more
realistic.  Also, it isn't hard to get about a 14% compression by using
structs rather than objects, and better string sharing, as the last two
lines show.  So, the resulting C++ objects might be fairly comparable.

Exhibit 2 contains the code.

k

=== Exhibit 1 ===
#||
RELATIVE
SIZE                       WHAT:
99.99 (%size 0 0 0 0) ; Original Lisp encoding.
93.18 (%size 0 0 4 0) ; Remove immediate floats,
93.16 (%size 0 1 4 0) ; And remove struct overhead,
90.29 (%size 1 1 4 0) ; And remove some object overhead,
75.95 (%size 6 1 4 0) ; And remove all object overhead.
91.39 (%size 3 0 0 0) ; Objects -> structs, only.
95.00  estimate       ; Better string sharing, only.
||#

=== Exhibit 2 ===
(defun object-size (objOV strucOV floatOV builtinOV)
  (flet
      ((obj (type words refs objs r/o w/o)
	 (declare (ignore type refs r/o w/o))
	 (size words objs objOV))
       (struc (type words refs objs r/o w/o)
	 (declare (ignore type refs r/o w/o))
	 (size words objs strucOV))
       (sfloat (type words refs objs r/o w/o)
	 (declare (ignore type refs r/o w/o))
	 (size words objs floatOV))
       (builtin (type words refs objs r/o w/o)
	 (declare (ignore type refs r/o w/o))
	 (size words objs builtinOV)))
    ;; Storage report for #<NUDART:TPFDD 118PW-TRANS-6MAY94-CINS3>
    ;;        Type                      Words Reference Objects   Refs/Obj   Words/object
    (+
     (obj    'DART-LEVEL-3-ULN          354965  19775    8255         2.40        43.00)
     (obj    'DART-LEVEL-3-CIN          328020   9940    9940         1.00        33.00)
     (obj    'TPFDD-CARGO-CATEGORIES    116523  10593   10593         1.00        11.00)
     (obj    'HASH-TABLE                 63519     35      35         1.00      1814.83)
     (obj    'CARGO-PORTION              62536   7817    7817         1.00         8.00)
     (obj    'DOUBLE-FLOAT               58878   9977    9813         1.02         6.00)
     (obj    'TUCHA-CARGO-CATEGORIES     43140  19736    4314         4.57        10.00)
     (obj    'ICAO-GEOLOC                 2888    152     152         1.00        19.00)
     (obj    'GEOLOC                      2808    156     156         1.00        18.00)
     (obj    'TPFDD-GEOLOC                2156  73168     308       237.56         7.00)
     (obj    'GEOLOC-COUNTRY               670    380      67         5.67        10.00)
     (obj    'DART-PIN                     660     40      20         2.00        33.00)
     (obj    'GEOLOC-PROVINCE              504    163      72         2.26         7.00)
     (obj    'FORCE-MODULE                 500    116      25         4.64        20.00)
     (obj    'UNIT-LEVEL                   270   8250      45       183.33         6.00)
     (obj    'PROVIDING-ORGANIZATION       119  18215      17      1071.47         7.00)
     (obj    'TPFDD                         29  18241       1     18241.00        29.00)

     (struc  'LOC                         1540    308     308         1.00         5.00)


     (sfloat  'SINGLE-FLOAT               83452  20863   30676         1.00         4.00)

     (builtin 'SYMBOL                       654 162221     109      1488.27         6.00)
     (builtin 'VECTOR                    213552  78482   39293         2.00         5.43)
     (builtin 'STRING                    203247 103192   59797         1.73         3.40)
     (builtin 'NULL                           6 166455       1    166455.00         6.00)
     (builtin 'CONS                      258606  77947   49697         1.57         5.20)
     )))

(defun size (words objects overhead)
  (* objects (- (/  words objects) overhead)))

(defun %size (objOV strucOV floatOV builtinOV)
  (/ (* 100.0 (object-size objOV strucOV floatOV builtinOV))
     (object-size 0 0 0 0)))

--
Ken Anderson 
Internet: ·········@bbn.com
BBN ST               Work Phone: 617-873-3160
10 Moulton St.       Home Phone: 617-643-0157
Mail Stop 6/4a              FAX: 617-873-2794
Cambridge MA 02138
USA

From: Thomas M. Breuel
Subject: Re: data bloat in CL
Date: Sat, 27 Aug 1994 01:34:46 +0000
Message-ID: <TMB.94Aug27033446@arolla.idiap.ch>

In article <······················@wheaton.bbn.com> ········@wheaton.bbn.com (Ken Anderson) writes:
|I've included my code below, but here is a summary.  We have a composit
|object consisting of about 225,000 objects from 25 classes, and about 1.7
|Mega words total in size.
|nnnnnn
|Exhibit one shows the effect of removing various overheads.  The best i
|could do, after removing all overhead on everything but builtin objects,
|was about 25%.  Doing a detailed convertion to C++, might give you 30%, but
|you'd have to add back in some overhead, so maybe 15 to 20% is more
|realistic.  Also, it isn't hard to get about a 14% compression by using
|structs rather than objects, and better string sharing, as the last two
|lines show.  So, the resulting C++ objects might be fairly comparable.

If I interpret your data correctly, it isn't particularly surprising
that you don't get a lot of savings for your code: the objects that
are taking up the most space are very large (43 and 33 words each,
respectively), and most of your other objects seem pretty big as
well.  Also, your total application is pretty small (1.7 Mwords).
The kinds of applications I'm talking about have millions of small
objects (say, 4 words each).

I have had parallel implementations in Lisp and C in the past (I
prototype in Lisp and similar languages), and the differences have
been large.  I'll try and find some measurements myself.

				Thomas.

From: Kelly Murray
Subject: Re: data bloat in CL
Date: Fri, 26 Aug 1994 16:43:47 +0000
Message-ID: <33l643$la9@sand.cis.ufl.edu>

In article <·················@arolla.idiap.ch>, ···@arolla.idiap.ch (Thomas M. Breuel) writes:
|> In article <··········@info-server.bbn.com> ·····@labs-n.bbn.com writes:
|> |In article <·················@arolla.idiap.ch> ···@arolla.idiap.ch (Thomas M. Breuel) writes:
|> |--> (think about how much space your
|> |--> typical "struct { int x; double y; char z;};" takes as a CommonLisp
|> |--> DEFSTRUCT)
|> |
|> |my guess is that it would be 128 bits. 4 words, expecting word-alignment
|> |behavior (well, that's machine-dependent stuff, I was thinking of a
|> |32-bit machine).
|> [...]
|> (1)	struct { int x,y; } foo[N];		    - +100% (4N vs. 2N words)
|> 
|> (2)	struct { char x,y,z,q; } foo[N];	    - +600% (6N vs. 1N words)
|> 
|> (3)	struct { float vect1[3],vect2[3]; } foo[N]; - +130% (14N vs. 6N words)
|> [...]
|> CommonLisp implementations probably could do much better than they are
|> doing right now even using existing language constructs.  For example,
|> they might be able to pack structures just like C and implement
|> (MAKE-ARRAY ... :ELEMENT-TYPE 'SOME-STRUCTURE-TYPE) in such a way as
|> to avoid pointers and individual structure headers, and AREF and
|> friends, in the presence of religious declarations, might be able to
|> do the right thing in terms of pulling out pointers into the interior
|> of such arrays (note that, in general, the pointers would have to
|> carry the structure type information).  But I have yet to see a
|> CommonLisp implementation that even tries.
|> 

My CommonLisp implementation did this!  (I wish I could show it to you)

I provided a macro Defstruct-Foreign which was used to define
packed structures primarily for passing to/from C,
but could just as well be used entirely within Lisp.

MAKE- and SETF'able slot-accessors were defined which returned
Lisp objects from the packed-struct, and also converted Lisp object
into the packed struct when set.

Can't recall exactly how I dealt with arrays, but I think I defined
an accessor/settor function that took an index, 
so  (defstruct-foriegn foo (a char 100))
would generate a (foo-a obj index) function which returned a char,
and was also SETF'able.

-- 
- Kelly Murray  (···@prl.ufl.edu) <a href="http://www.prl.ufl.edu">
-University of Florida Parallel Research Lab </a> 96-node KSR1, 64-node nCUBE

From: Lawrence G. Mayka
Subject: Re: data bloat in CL
Date: Sun, 28 Aug 1994 16:42:40 +0000
Message-ID: <LGM.94Aug28114240@polaris.ih.att.com>

In article <··········@sand.cis.ufl.edu> ···@prl.ufl.edu (Kelly Murray) writes:

   I provided a macro Defstruct-Foreign which was used to define
   packed structures primarily for passing to/from C,
   but could just as well be used entirely within Lisp.

   MAKE- and SETF'able slot-accessors were defined which returned
   Lisp objects from the packed-struct, and also converted Lisp object
   into the packed struct when set.

   Can't recall exactly how I dealt with arrays, but I think I defined
   an accessor/settor function that took an index, 
   so  (defstruct-foriegn foo (a char 100))
   would generate a (foo-a obj index) function which returned a char,
   and was also SETF'able.

Yes, CL implementations often provide "foreign-type" structures that
may be useful for economizing on space in exactly this way.
--
        Lawrence G. Mayka
        AT&T Bell Laboratories
        ···@ieain.att.com

Standard disclaimer.

From: Lawrence G. Mayka
Subject: Re: data bloat in CL
Date: Sun, 28 Aug 1994 16:37:04 +0000
Message-ID: <LGM.94Aug28113704@polaris.ih.att.com>

In article <······················@wheaton.bbn.com> ········@wheaton.bbn.com (Ken Anderson) writes:

   Yes, this would be nice.  I also wish we could stack allocate any kind of
   object, like you can in C++.

Most (all?) explicit object allocations can take a DYNAMIC-EXTENT
declaration.  I assume your wish refers to the fact that CL
implementations typically carry out stack allocation for only a few,
if any, datatypes.
--
        Lawrence G. Mayka
        AT&T Bell Laboratories
        ···@ieain.att.com

Standard disclaimer.