lisp performance questions and observations

From: rif
Subject: lisp performance questions and observations
Date: Sun, 24 Oct 2004 07:46:08 +0000
Message-ID: <wj01xfotv7j.fsf@five-percent-nation.mit.edu>

I've been writing a suffix tree program in CL.  It's pretty fast, but
the C reference program I'm comparing against (I have a binary only,
not the source) is still a bit faster, and uses much less memory,
which is important for suffix tree algorithms, because you want to run
on as large strings as possible.

I'm using CMUCL.

Things I am observing:

1.  Using structs under CMUCL, there appears to be an overhead of 8
    bytes/object, and everything is done in eight byte chunks ---
    either a one field struct or a two field struct costs 16 bytes.
    Given that the structs I need have an odd number of fields, I'm
    paying an extra 12 bytes for every node in my tree as compared to
    what I'd pay in C.

2.  The lack of "homogeneous" arrays of non-primitive types is
    painful.  It'd be really useful to be able to say "Here's a struct
    declaration.  Give me an array of these structs, where they're
    laid out 'in place'.  No extra indirection.

Has anyone else encountered these difficulties when trying to optimize
algorithms?  Is there anyway around them, in CMUCL or other CL
implementations?  I've gotten to the point where I know how to write
numerical matrix-based CL programs that are essentially as fast and
memory efficient as I'd write in C/C++, but it seems that for
"complicated data structures", like tree based algorithms, where I'd
really expect CL to excel, I'm paying a pretty big memory overhead.

Cheers,

rif

Re: lisp performance questions and observations Neo-LISPer
- Re: lisp performance questions and observations rif
Re: lisp performance questions and observations Duane Rettig
- Re: lisp performance questions and observations rif
  - Re: lisp performance questions and observations Duane Rettig
    - Re: lisp performance questions and observations rif
      - Re: nonhomogenous structs (was: lisp performance questions and observations) Duane Rettig
        Re: nonhomogenous structs (was: lisp performance questions and observations) rif
        Re: nonhomogenous structs (was: lisp performance questions and observations) Duane Rettig
        Re: nonhomogenous structs (was: lisp performance questions and observations) rif
        Re: nonhomogenous structs Rahul Jain
  - Re: lisp performance questions and observations Raymond Toy
    - Re: lisp performance questions and observations Duane Rettig
      - Re: lisp performance questions and observations rif
        Re: lisp performance questions and observations Raymond Toy
        Re: lisp performance questions and observations Jon Boone
      - Re: lisp performance questions and observations Raymond Toy
Re: lisp performance questions and observations Pascal Bourguignon
- Re: lisp performance questions and observations rif
  - Re: lisp performance questions and observations Pascal Bourguignon
Re: lisp performance questions and observations Rahul Jain
- Re: lisp performance questions and observations rif
  - Re: lisp performance questions and observations Alan Shutko
    - Re: lisp performance questions and observations Cameron MacKinnon
      - Re: lisp performance questions and observations rif
        Re: lisp performance questions and observations rif
        Re: lisp performance questions and observations Marco Antoniotti
        Re: lisp performance questions and observations Ross Lippert
  - Re: lisp performance questions and observations Jakub Travnik
    - Re: lisp performance questions and observations Rahul Jain
  - Re: lisp performance questions and observations Rahul Jain
    - Re: lisp performance questions and observations rif
    - Re: lisp performance questions and observations Raymond Toy
    - Re: lisp performance questions and observations Marco Antoniotti
Re: lisp performance questions and observations Joe Marshall

From: Neo-LISPer
Subject: Re: lisp performance questions and observations
Date: Sun, 24 Oct 2004 08:09:18 +0000
Message-ID: <87y8hwh70x.fsf@yahoo.com>

Have you thought about using some macro magic to transform those
arrays of structs into structs of arrays? I've heard there were some
performance-oriented C compilers that did something like this in a
manner transparent to the programmer. It might or might not improve
performance, depending on what you are doing, but I think it will save
you some memory.

From: rif
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 01:25:44 +0000
Message-ID: <wj0lldva8rr.fsf@five-percent-nation.mit.edu>

Neo-LISPer <··········@yahoo.com> writes:

> Have you thought about using some macro magic to transform those
> arrays of structs into structs of arrays? I've heard there were some
> performance-oriented C compilers that did something like this in a
> manner transparent to the programmer. It might or might not improve
> performance, depending on what you are doing, but I think it will save
> you some memory.

I haven't tried this yet, but it's on my to-do list.  I'd be worried
about locality of reference issues.  However, since all the fields in
my objects are fixnums, I can actually implement the whole thing under
the hood as a single large array of fixums.

rif

From: Duane Rettig
Subject: Re: lisp performance questions and observations
Date: Sun, 24 Oct 2004 22:44:26 +0000
Message-ID: <4ekjnybw5.fsf@franz.com>

rif <···@mit.edu> writes:

> I've been writing a suffix tree program in CL.  It's pretty fast, but
> the C reference program I'm comparing against (I have a binary only,
> not the source) is still a bit faster, and uses much less memory,
> which is important for suffix tree algorithms, because you want to run
> on as large strings as possible.
> 
> I'm using CMUCL.
> 
> Things I am observing:
> 
> 1.  Using structs under CMUCL, there appears to be an overhead of 8
>     bytes/object, and everything is done in eight byte chunks ---
>     either a one field struct or a two field struct costs 16 bytes.
>     Given that the structs I need have an odd number of fields, I'm
>     paying an extra 12 bytes for every node in my tree as compared to
>     what I'd pay in C.
> 
> 2.  The lack of "homogeneous" arrays of non-primitive types is
>     painful.  It'd be really useful to be able to say "Here's a struct
>     declaration.  Give me an array of these structs, where they're
>     laid out 'in place'.  No extra indirection.
> 
> Has anyone else encountered these difficulties when trying to optimize
> algorithms?  Is there anyway around them, in CMUCL or other CL
> implementations?  I've gotten to the point where I know how to write
> numerical matrix-based CL programs that are essentially as fast and
> memory efficient as I'd write in C/C++, but it seems that for
> "complicated data structures", like tree based algorithms, where I'd
> really expect CL to excel, I'm paying a pretty big memory overhead.

Since it seems that you're porting a C solution to Lisp, you might
consider using "C types" as well.  I'm talking about type extensions
beyond CL which emulate the C type and struct system.  Allegro CL's
version is called a "foreign type" - I don't know if CMUCL has an
equivalent.  For a description of Allegro CL's version see
http://www.franz.com/support/documentation/6.2/doc/ftype.htm

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182

From: rif
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 01:24:53 +0000
Message-ID: <wj0pt37a8t6.fsf@five-percent-nation.mit.edu>

> Since it seems that you're porting a C solution to Lisp, you might
> consider using "C types" as well.  I'm talking about type extensions
> beyond CL which emulate the C type and struct system.  Allegro CL's
> version is called a "foreign type" - I don't know if CMUCL has an
> equivalent.  For a description of Allegro CL's version see
> http://www.franz.com/support/documentation/6.2/doc/ftype.htm

What do you mean specifically by "porting a C solution to Lisp?"
Certainly, I have no C code.  I started with an algorithm that I'd
read in a paper.  I understand that I can get more speed by "going
down to C", this is certainly a possibility.  I'm just interested in
exploring the solution space right now, trying to see how good a
program I can write in pure (or portable) CL.  I've often heard that
very fast programs can be written in pure CL (with a good compiler).
For numerical work, I've found this to be true --- if I declare all my
arrays to be arrays of double-floats (and this can be done very
cleanly with macros), I get a program that uses very close to the time
and space of the equivalent C program, while being much more pleasant
to write.  Now I am trying to see if, for "complicated data
structures" programs, CL is at more of a disadvantage to C.  So far,
it seems to me that it is --- it is tough to write programs that use
an equivalent amount of memory to C without drilling down to the
foreign object level.

For anyone who knows both system, are CMUCL's aliens roughly the
equivalent of ACL's foreign types?

rif

From: Duane Rettig
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 06:18:43 +0000
Message-ID: <4brergw1o.fsf@franz.com>

rif <···@mit.edu> writes:

> > Since it seems that you're porting a C solution to Lisp, you might
> > consider using "C types" as well.  I'm talking about type extensions
> > beyond CL which emulate the C type and struct system.  Allegro CL's
> > version is called a "foreign type" - I don't know if CMUCL has an
> > equivalent.  For a description of Allegro CL's version see
> > http://www.franz.com/support/documentation/6.2/doc/ftype.htm
> 
> What do you mean specifically by "porting a C solution to Lisp?"
> Certainly, I have no C code.  I started with an algorithm that I'd
> read in a paper.

Perhaps the algorithm itself is C-centric.

>  I understand that I can get more speed by "going
> down to C", this is certainly a possibility.

What leads you to this conclusion?  If you are "going down to
C", where are you coming from, and if you want the speed, why
don't you "go down to asm"?

>  I'm just interested in
> exploring the solution space right now, trying to see how good a
> program I can write in pure (or portable) CL.  I've often heard that
> very fast programs can be written in pure CL (with a good compiler).

Yes, and with a good algorithm and data representation.

> For numerical work, I've found this to be true --- if I declare all my
> arrays to be arrays of double-floats (and this can be done very
> cleanly with macros), I get a program that uses very close to the time
> and space of the equivalent C program, while being much more pleasant
> to write.

Agreed.

>  Now I am trying to see if, for "complicated data
> structures" programs, CL is at more of a disadvantage to C.  So far,
> it seems to me that it is --- it is tough to write programs that use
> an equivalent amount of memory to C without drilling down to the
> foreign object level.

Of course it is.  Now tell me why you don't recognize this as a
C-centric solution.  Guess if you have to.  Another way to put the
question is: take a guess as to why I consider it a C-centric
solution.

[I have more to say, but we need to get past this fundamental
issue first] 

> For anyone who knows both system, are CMUCL's aliens roughly the
> equivalent of ACL's foreign types?

I'll leave this part to someone who knows the answer to it.


-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182

From: rif
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 16:13:58 +0000
Message-ID: <wj0wtxe3hdl.fsf@five-percent-nation.mit.edu>

Duane,

Thanks for the comments so far.

> > What do you mean specifically by "porting a C solution to Lisp?"
> > Certainly, I have no C code.  I started with an algorithm that I'd
> > read in a paper.
> 
> Perhaps the algorithm itself is C-centric.

Could you please define what you mean by "an algorithm being
C-centric", or at least give specific examples that would allow me to
understand what you mean?  The algorithms are described in text or
mathematics.  What makes an algorithm C-centric?

> >  I understand that I can get more speed by "going
> > down to C", this is certainly a possibility.
> 
> What leads you to this conclusion?  If you are "going down to
> C", where are you coming from, and if you want the speed, why
> don't you "go down to asm"?

All I meant here was that I can directly see how, in C, to avoid
certain indirections and to use substantially less memory than what
I'm using in CL right now.  In contrast, I cannot see how I could code a
substantially lower-memory solution in asm than what I know how to
code in C, which is why I'd stop a see rather than going to asm.

By "going down to C", I meant only that C feels "closer to the
machine" than CL to me (where "the machine" refers to a circa-2004
linux/x86 workstation, to avoid talk of Lisp Machines).  This
closeness seems to make it easier to implement certain algorithms
using less memory than CL does.  Perhaps this is what you mean by
"C-centric"?

> >  I'm just interested in
> > exploring the solution space right now, trying to see how good a
> > program I can write in pure (or portable) CL.  I've often heard that
> > very fast programs can be written in pure CL (with a good compiler).
> 
> Yes, and with a good algorithm and data representation.

What is the purpose of this comment?  Are you implying that I have a
bad algorithm or a bad representation?  Certainly, I agree that good
algorithms and good representations are important.

> >  Now I am trying to see if, for "complicated data
> > structures" programs, CL is at more of a disadvantage to C.  So far,
> > it seems to me that it is --- it is tough to write programs that use
> > an equivalent amount of memory to C without drilling down to the
> > foreign object level.
> 
> Of course it is.  Now tell me why you don't recognize this as a
> C-centric solution.  Guess if you have to.  Another way to put the
> question is: take a guess as to why I consider it a C-centric
> solution.
> 
> [I have more to say, but we need to get past this fundamental
> issue first] 

So you'll agree that CL is at more of a disadvantage to C for
"complicated data structures" programs?  Perhaps we can get past the
fundamental issue more quickly by you explaining why you consider this
C-centric, as I really have no idea what you're driving at.  I'm very
curious about what you have to say, as your comments here are usually
very insightful, but right now I'm not following you.  I just don't
know what you mean by "C-centric".

Cheers,

rif

ps.  I've been posting in this forum for a couple years now.  You can
refer to me as "rif" rather than "the OP" if you want.

From: Duane Rettig
Subject: Re: nonhomogenous structs (was: lisp performance questions and observations)
Date: Mon, 25 Oct 2004 17:58:22 +0000
Message-ID: <4acuahe81.fsf_-_@franz.com>

rif <···@mit.edu> writes:

> Duane,
> 
> Thanks for the comments so far.
> 
> > > What do you mean specifically by "porting a C solution to Lisp?"
> > > Certainly, I have no C code.  I started with an algorithm that I'd
> > > read in a paper.
> > 
> > Perhaps the algorithm itself is C-centric.
> 
> Could you please define what you mean by "an algorithm being
> C-centric", or at least give specific examples that would allow me to
> understand what you mean?  The algorithms are described in text or
> mathematics.  What makes an algorithm C-centric?

I was hoping to be able to lead you to an "aha" experience.  This
series of seemingly opaque questions was designed for that, but it
is clear to me that they're unfairly opaque, because you would
have to be extremely lucky to stumble onto the solution.  I still
want you to think a little about it, but let me give you some
background and a sample as a hint, below.

> > >  I understand that I can get more speed by "going
> > > down to C", this is certainly a possibility.
> > 
> > What leads you to this conclusion?  If you are "going down to
> > C", where are you coming from, and if you want the speed, why
> > don't you "go down to asm"?
> 
> All I meant here was that I can directly see how, in C, to avoid
> certain indirections and to use substantially less memory than what
> I'm using in CL right now.  In contrast, I cannot see how I could code a
> substantially lower-memory solution in asm than what I know how to
> code in C, which is why I'd stop a see rather than going to asm.

Right; and this kind of programming is possible (though not portably)
using at least one Common Lisp implementation's foreign type system.

> By "going down to C", I meant only that C feels "closer to the
> machine" than CL to me (where "the machine" refers to a circa-2004
> linux/x86 workstation, to avoid talk of Lisp Machines).  This
> closeness seems to make it easier to implement certain algorithms
> using less memory than CL does.  Perhaps this is what you mean by
> "C-centric"?

If space is your only concern, then perhaps yes.  But C pays a big
performance price for space-conciousness, and that's why I don't
view going down to the C level as efficient as going down to the
asm (or, better, the machine) level.  Read on...

> > >  I'm just interested in
> > > exploring the solution space right now, trying to see how good a
> > > program I can write in pure (or portable) CL.  I've often heard that
> > > very fast programs can be written in pure CL (with a good compiler).
> > 
> > Yes, and with a good algorithm and data representation.
> 
> What is the purpose of this comment?  Are you implying that I have a
> bad algorithm or a bad representation?  Certainly, I agree that good
> algorithms and good representations are important.

Since you haven't explicitly stated the assumptions and limits of
the requirements of your application, nor have you shown any examples
of your solutions, it becomes a little harder to know whether your
representations are good or not.  I assume that you'll be able to
make that assessment yourself, if you can come to understand that the
assumptions you have in fact demonstrated here can be broken down
further and analyzed.  Perhaps you'll draw the same conclusion, but
it will at least be based on deeper knowedge of what the dynamics
between Lisp, C, and the machine are.  Read on...

> > >  Now I am trying to see if, for "complicated data
> > > structures" programs, CL is at more of a disadvantage to C.  So far,
> > > it seems to me that it is --- it is tough to write programs that use
> > > an equivalent amount of memory to C without drilling down to the
> > > foreign object level.
> > 
> > Of course it is.  Now tell me why you don't recognize this as a
> > C-centric solution.  Guess if you have to.  Another way to put the
> > question is: take a guess as to why I consider it a C-centric
> > solution.
> > 
> > [I have more to say, but we need to get past this fundamental
> > issue first] 
> 
> So you'll agree that CL is at more of a disadvantage to C for
> "complicated data structures" programs?

Yes, absolutely.  But I am still curious as to why you assume that
this is a negative for CL.  Read on...

  Perhaps we can get past the
> fundamental issue more quickly by you explaining why you consider this
> C-centric, as I really have no idea what you're driving at.  I'm very
> curious about what you have to say, as your comments here are usually
> very insightful, but right now I'm not following you.  I just don't
> know what you mean by "C-centric".

OK, here are a couple of clues.

First, note that I am only using the term "C-centric" within the
context of this discussion; if we bring other languages into the
mix then the teminology might have to be changed.  The players in
this discussion are:

 - CL
 - C
 - The hardware.

Now, with respect to the homogeneity of data structures, which is
closer to hardware: C or CL?  The answer to this might be surprising
in several ways.

Let's take a simple example; bit fields.  C provides for a
semantically simple way to access a random field of bits as
a unit within a data structure.  And many of the modern
architectures provide for bit manipulation instructions to
accomodate this concept.  But how efficient is such a field of
bits, really?  Let's break it down into the two components of
optimization: speed and space:

 - speed: The speed of bit fields depends on both access and
manipulation (e.g. shifting and masking). Over the last 40
years, the popularity of bit manupulation instructions has
been inconsistent.  Some architectures provide extract and
deposit instructions, and others do not, and require explicit
shifing and masking (for extraction) or shifting, masking,
and oring in (for deposits).  Note however, that when the bit
field happens to match the width of an access instruction
(say load-byte/char, load-short, load-integer, load-long)
there is no extra calculation necessary and the extra bit
manipulation instructions needed to perform that extraction
or deposit are just that: extra.  It is even worse in the
storing case; in order to store a byte, ... long, one just
uses that instruction, but in order to store a bit field of
different width or origin, one must access a larger field,
mask and insert the new data, and store the larger field
again.

So which is faster, a three-bit field ranging from 0 - 7,
or an aligned 32-bit field ranging from 0 - 7?  Ironically,
this makes C, in one sense, a higher-level language than CL.

 - size:  It is of course obvious that when structures are
completely packed, they can become much more space efficient
with bit fields than with whole words.  A 32-bit word holding
a three bit field is wasting 29 bits, but _only_ if those 29
bits can be used for something else.  But consider a data
structure that has an int follwed by a 3-bit field, followed by
a double (a 64-bit float).  Now, how many of the 29 extra bits
can be used?  Answer: none - they're all wasted.  If you don't
have other bits that can be inserted, you might just as well
make the field an int, and get back the speed you would have lost
on the extra bit manipulation.

So which is smaller in a struct, a three-bit field ranging
from 0 - 7, or a 32-bit field ranging from 0 - 7? (answer:
if taken by itself or with other larger fields, neither is
smaller).

So you might read and understand this, and say "But I don't care about
bit fields; I care about the mixture of bytes, ints, longs, floats,
and double-floats".  And if this is the case (as I suspect it is), I
will leave it to you to guess what how I might respond to that...

> ps.  I've been posting in this forum for a couple years now.  You can
> refer to me as "rif" rather than "the OP" if you want.

If you're asking me to be less impersonal, then that is hard to do
with just a handle, even if it is part of your name, so I don't see
"rif" as any more personal than "the OP".  If you truly prefer to be
called "rif" than Ryan or Michael, then I suppose I could do that,
though that doesn't tend to be my style.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182

From: rif
Subject: Re: nonhomogenous structs (was: lisp performance questions and observations)
Date: Tue, 26 Oct 2004 15:20:14 +0000
Message-ID: <wj0acu91p75.fsf@five-percent-nation.mit.edu>

Duane Rettig <·····@franz.com> writes:

[many interesting comments deleted] 

> So you might read and understand this, and say "But I don't care about
> bit fields; I care about the mixture of bytes, ints, longs, floats,
> and double-floats".  And if this is the case (as I suspect it is), I
> will leave it to you to guess what how I might respond to that...

Upon the first couple readings of your post, I thought that the thrust
of your argument is that there is a speed tradeoff in using less space
--- as I start to use fields that need to be bitmasked and shifted to
be extracted, I will pay a speed penalty as compared to using machine
word sized pieces.  (I am certainly aware of this issue, as it has
come up for me before in the "double-float floating point operations
are often much faster than single-float, in addition to being more
precise.")

However, I am clearly still missing something, as I don't think I
understand the difference between C and asm in this context.  Perhaps
you meant that in C, one might end up (naively?) using bit-field
extraction operations on fields that were already properly aligned,
which is therefore wasteful, and in assembly you'd be likely to notice
this?  It seems unlikely to me that that's what you meant, but I'm not
sure what you did mean.

In my application, all the fields are actually integers.  (To be fair,
I had not said this before.)  It's fine to represent them as 32-bit
ints (or as CMUCL fixnums, actually).  I'm not trying to squeeze every
last bit out of each field, as I wouldn't want to pay the speed price
for that in C.  The big problem with my CL version is not about
bit-packing, it's about the fact that if what I want seems to be
naturally represented as an array of structs, I'm going to pay (under
CMUCL) an extra 3 machine words/object: one word because instead of
laying the objects directly into the array, I have an array of
pointers, and two more words because each struct has to have the
header that says what type it is, even though I know in advance that
they're all the same type.

(Note that in this application, because everything is actually an
integer, I can represent the array of structs as a single big array of
integers.  I can develop syntax that hides this abstraction from the
algorithm, and this is what I'm likely to do.  If I had a mix of
"individually well-aligned" types, things would be tougher --- I'd
have to use parallel arrays to do it cleanly in CL, and my locality of
reference would suck, even in the case where everything laid out
nicely in C.)

So actually, I don't care about the mixture between bytes, longs,
ints, and floats.  I care about the fact that I'm paying 3 extra
machine words for each struct before I even get to the "data" for the
struct itself.

I still don't know what "C-centric" means in the context of this
conversation.  I suspect I misunderstood your point; I apologize in
advance.

rif

> > ps.  I've been posting in this forum for a couple years now.  You can
> > refer to me as "rif" rather than "the OP" if you want.
> 
> If you're asking me to be less impersonal, then that is hard to do
> with just a handle, even if it is part of your name, so I don't see
> "rif" as any more personal than "the OP".  If you truly prefer to be
> called "rif" than Ryan or Michael, then I suppose I could do that,
> though that doesn't tend to be my style.

Well, I'm not asking you to do anything (although I'll appreciate it
if you continue enlightening me), I'm simply inviting you.  Ryan
Rifkin is my name.  Rif is the nickname I generally go by, and it's
what most people call me.  If you're more comfortable with "The OP" or
"Mr. Rifkin" that's fine.  I might object to derogatory epithets, but
that doesn't seem to be much of a problem in this forum.

From: Duane Rettig
Subject: Re: nonhomogenous structs (was: lisp performance questions and observations)
Date: Tue, 26 Oct 2004 17:43:56 +0000
Message-ID: <4sm81bcir.fsf@franz.com>

rif <···@mit.edu> writes:

> Duane Rettig <·····@franz.com> writes:
> 
> [many interesting comments deleted] 
> 
> > So you might read and understand this, and say "But I don't care about
> > bit fields; I care about the mixture of bytes, ints, longs, floats,
> > and double-floats".  And if this is the case (as I suspect it is), I
> > will leave it to you to guess what how I might respond to that...
> 
> Upon the first couple readings of your post, I thought that the thrust
> of your argument is that there is a speed tradeoff in using less space
> --- as I start to use fields that need to be bitmasked and shifted to
> be extracted, I will pay a speed penalty as compared to using machine
> word sized pieces.  (I am certainly aware of this issue, as it has
> come up for me before in the "double-float floating point operations
> are often much faster than single-float, in addition to being more
> precise.")

Yes, that's _part_ of it.

> However, I am clearly still missing something, as I don't think I
> understand the difference between C and asm in this context.  Perhaps
> you meant that in C, one might end up (naively?) using bit-field
> extraction operations on fields that were already properly aligned,
> which is therefore wasteful, and in assembly you'd be likely to notice
> this?  It seems unlikely to me that that's what you meant, but I'm not
> sure what you did mean.

No.  Actually, it is more important to understand the difference
between asm and "the machine level" (which is what I've been
emphasizing) than the differences between asm and C.  In this
context asm is not likely to know any more about the hardware than
C.

We programmers tend to view the instruction level of an architecture
as atomic.  This is reinforced by the limited hardware courses we
might take, where we might build some simple CPU which has a single
instruction decoder and which executes instructions one-per-cycle.
But in order to know what is really efficient, we must dive further
into the hadrware than just to the asm level.

We do get a smattering of this - terms like "pipeline stalls" and
"locality of reference" make their way into our vocabulary, even
as programmers.  [Note that I include myself in this group, even
though my initial career had been in hardware - it is the _context_
of software which makes us ignore hardware concepts that might have
surprising effects on the software we write]. But the hardware field
continues to advance, and I notice some trends:

1. Data access is getting wider all the time.  You can still access very
small amounts of data, it just becomes less and less efficient in
relative terms; wider-word accesses are what are being optimized more
than sub-word accesses.  DEC [sic] engineers recognized this when they
first designed the Alpha architecture, and didn't even _have_ sub-word
accessors like load-byte or store-byte instructions; you had to do
shift-and-mask operations to get byte values (this ommission turned out
to be a mistake and they created new sub-word accessors after all; even
thought they were grossly inefficient, programmers rejected the lack of
byte-accessors because it was what they were used to).

2. Caches are expanding.  When "virtual memory" (i.e. paging) was first
invented, the concept of "locality of reference" meant "not forcing a
page from amin memory out to swap disk".  In practical terms, it became
important to do as much work as possible on one page, so that all of
the work is likely to stay in main memory.  Nowadays, main memory is
much too slow; you want to do all of your work in "cache lines".  This
means that if you have an array-of-structs with three elements in each
inner struct, and you are looping on each array element and doing work
with one or two of the three elements, it would actually give _better_
locality of reference to rearrange this data structure into three
arrays with one individual element in each - the cache lines are exhausted
at a rate of 1/3 or 2/3 the frequency that they are when the inner
structs are grouped together.

3. Problem sets are being optimized for their vectorizability.  What
this means is that if you lay out your data in vectors of words,
vectors of bytes, vectors of floats, vectors of double-floats, then
a smarter-compiler (tm) might be able to vectorize operations on
these vectors.  If you examine one of the SSE/SSE2 architecture manuals,
you'll see loads of instructions that perform the same operations on
same-size-same-type operands, packed as many as possible into 128-bit
words at a time.  I don't know how many compilers can do this at this
time, but that is definitely the trend.

4. Vector concepts are being applied to instruction decode and execution.
This is one I've only recently learned, while doing optimizations on
our new AMD64 64-bit Allegro CL port; there is an optimization guide,
and one of the missives in it is that mixing different-size operations
(e.g., "movq rax,rbx" followed by "cmpb al,0") will cause pipeline
stalls and false-dependencies (in this conext the al/ax/eax/rax register
is viewed by the hardware as two different registers, even though they
share sets of bits).

The conclusion is that in general, vectors of non-homogenous structs
are becoming less and less efficient in a relative sense, which means
that homogenous data is becoming more and more efficient in newer
harwdare.

Analyzing this: Lisp never did non-homogenous well; but C does
non-homogenous in an increasingly pessimized way (unless the compiler
is able to reasom about the aliasing and physically break up the
inner struct, in sort of a pseudo-strength-reduction rewrite -
I strongly doubt that this is the case, or whether C as a language
would even allow such a restructuring).  In other words, whether
by design or by accident, Lisp anticipated the direction that
modern hardware would take!

> In my application, all the fields are actually integers.  (To be fair,
> I had not said this before.)

Yes; it is possible that you have a problem set that minimizes the
distinction I'm making.  However, I suspect (even as yet not knowing
what your requirements are) that there is still some gain to be had in
going ahead and breaking up those vectors-of-structs into vectors.

  It's fine to represent them as 32-bit
> ints (or as CMUCL fixnums, actually).  I'm not trying to squeeze every
> last bit out of each field, as I wouldn't want to pay the speed price
> for that in C.  The big problem with my CL version is not about
> bit-packing, it's about the fact that if what I want seems to be
> naturally represented as an array of structs, I'm going to pay (under
> CMUCL) an extra 3 machine words/object: one word because instead of
> laying the objects directly into the array, I have an array of
> pointers, and two more words because each struct has to have the
> header that says what type it is, even though I know in advance that
> they're all the same type.
> 
> (Note that in this application, because everything is actually an
> integer, I can represent the array of structs as a single big array of
> integers.  I can develop syntax that hides this abstraction from the
> algorithm, and this is what I'm likely to do.  If I had a mix of
> "individually well-aligned" types, things would be tougher --- I'd
> have to use parallel arrays to do it cleanly in CL, and my locality of
> reference would suck, even in the case where everything laid out
> nicely in C.)

I suspect that you will be surprised at the result of this exercise;
it may even get _faster_, depending on the precentage of the inner structs
you access in any particular loops.  If you even have only one
loop that works on part of the cross-section of data at each index,
it would still probably be worth breaking the structs apart.

> So actually, I don't care about the mixture between bytes, longs,
> ints, and floats.  I care about the fact that I'm paying 3 extra
> machine words for each struct before I even get to the "data" for the
> struct itself.

OK, but in today's hardware, unless your data is 128 bits long, it
can still stand optimization by rearranging into vectors.  As you
obviously know (since you proposed the solution and rejected it for
other reasons) this only requires up to three extra words per
inner-struct-element-item which should get you much closer to your
goal of matching the C space requirement.

> I still don't know what "C-centric" means in the context of this
> conversation.  I suspect I misunderstood your point; I apologize in
> advance.

Hopefully by now you've figured out that C-centric means "arrays of
non-homogenous structs (note that the subject line, which I had changed
last iteration, provides the clue).  It is C-centric because of the
threelanguages under consideration: C, asm (which is agnostic), and
CL (which looks for word orientation), C is the only one which
considers packing nonhomogenous data into a struct and replecating
that concoction to be the most efficient way of doing things.  In
fact, it is becoming more and more machine-eccentric and pessimized.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182

From: rif
Subject: Re: nonhomogenous structs (was: lisp performance questions and observations)
Date: Sun, 31 Oct 2004 02:44:39 +0000
Message-ID: <wj0d5yz38tk.fsf@geskekelud.mit.edu>

Duane Rettig <·····@franz.com> writes:

[many more interesting points deleted]

Thanks for all your advice and suggestions so far.  You make a lot of
interesting and good points.

I am willing to agree to what I believe is your main point: in
general, arrays of non-homogeneous structs are becoming less
efficient.  Your point about parallel arrays being *more* efficient in
the cases where there's at least some locality is well worth keeping
in mind; although it doesn't apply to my particular program/algorithm.

I think I have a better understanding of the tradeoffs now, and I
should read some more papers on different algorithms and write some
more code.

Thanks for being patient.

Cheers,

rif

From: Rahul Jain
Subject: Re: nonhomogenous structs
Date: Sun, 31 Oct 2004 03:28:01 +0000
Message-ID: <87k6t78t32.fsf@nyct.net>

rif <···@mit.edu> writes:

> I am willing to agree to what I believe is your main point: in
> general, arrays of non-homogeneous structs are becoming less
> efficient.

Also, the in type of array you want (where the struct is not referenced,
but is embedded in the array), the elements of the array are not the
structs. In fact, the struct is not a class and the elements of the
array are not even objects. You effectively have a 2D array where one
dimension is the index into your conceptual array and the other
dimension is the specific slot of that data structure.

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist

From: Raymond Toy
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 13:45:07 +0000
Message-ID: <sxdmzyasyho.fsf@edgedsp4.rtp.ericsson.se>

>>>>> "rif" == rif  <···@mit.edu> writes:

    rif> For anyone who knows both system, are CMUCL's aliens roughly the
    rif> equivalent of ACL's foreign types?

Don't know since I've never used ACL's foreign types.  I think they're
comparable.  But CMUCL's documentation isn't a secret.  You can find
it at

http://common-lisp.net/project/cmucl/doc/cmu-user/aliens.html#toc248 

and other nearby places.

Ray

From: Duane Rettig
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 15:28:12 +0000
Message-ID: <4fz42hl6b.fsf@franz.com>

Raymond Toy <···········@ericsson.com> writes:

> >>>>> "rif" == rif  <···@mit.edu> writes:
> 
>     rif> For anyone who knows both system, are CMUCL's aliens roughly the
>     rif> equivalent of ACL's foreign types?
> 
> Don't know since I've never used ACL's foreign types.  I think they're
> comparable.  But CMUCL's documentation isn't a secret.  You can find
> it at
> 
> http://common-lisp.net/project/cmucl/doc/cmu-user/aliens.html#toc248 
> 
> and other nearby places.

Well, I'm glad you've answered here, because I wouldn't want to make
any assumptions.  According to this documentation, CMUCL's alien types
seem similar to Allegro CL's foreign types when applied to the domain
of foreign calls.  However, the OP's programming style suggests that
he wants to be able to allocate, manipulate, and set values into and
get values from the various fields in one of these structures, without
necessarily using the structure as an argument to a foreign call, and
also (since the program is being written in Lisp) to be able to not
worry about deallocating the object.  This also implies that perhaps
an alien struct type might be able to be overlayed onto a lisp object
(preferably a specialized array), or, alternatively, that an alien
type object could be allocated in such a way that makes it
garbage-collectable.  Is either of these a possibility in CMUCL?  I
couldn't infer it from that documentation.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182

From: rif
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 16:03:56 +0000
Message-ID: <wj01xfm4wer.fsf@five-percent-nation.mit.edu>

Duane Rettig <·····@franz.com> writes:

> Raymond Toy <···········@ericsson.com> writes:
> 
> > >>>>> "rif" == rif  <···@mit.edu> writes:
> > 
> >     rif> For anyone who knows both system, are CMUCL's aliens roughly the
> >     rif> equivalent of ACL's foreign types?
> > 
> > Don't know since I've never used ACL's foreign types.  I think they're
> > comparable.  But CMUCL's documentation isn't a secret.  You can find
> > it at
> > 
> > http://common-lisp.net/project/cmucl/doc/cmu-user/aliens.html#toc248 
> > 
> > and other nearby places.
> 
> Well, I'm glad you've answered here, because I wouldn't want to make
> any assumptions.  According to this documentation, CMUCL's alien types
> seem similar to Allegro CL's foreign types when applied to the domain
> of foreign calls.  However, the OP's programming style suggests that
> he wants to be able to allocate, manipulate, and set values into and
> get values from the various fields in one of these structures, without
> necessarily using the structure as an argument to a foreign call, and
> also (since the program is being written in Lisp) to be able to not
> worry about deallocating the object.  This also implies that perhaps
> an alien struct type might be able to be overlayed onto a lisp object
> (preferably a specialized array), or, alternatively, that an alien
> type object could be allocated in such a way that makes it
> garbage-collectable.  Is either of these a possibility in CMUCL?  I
> couldn't infer it from that documentation.

In this case, I was already familiar with the documentation and use
CMUCL aliens.  What I *wasn't* familiar with was ACL foreign types; I
read the page briefly, but I didn't feel that asking whether they were
similar (to make sure I wasn't grossly misunderstanding) was out of
line.

CMUCL aliens do what I want conceptually.  I don't actually need to
have these arrays garbage collected.  Basically, the way I know of to
use the least memory for this problem involves allocating one big
array and just using it for the course of a problem. 

(Unfortunately, I only get 256 MB of malloc'able "C heap" with the
current CMUCL.  I don't know if this is easily user changeable; I
suspect it involves at least a recompile of CMUCL.)

Cheers,

rif

From: Raymond Toy
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 17:56:30 +0000
Message-ID: <sxd7jped6lt.fsf@edgedsp4.rtp.ericsson.se>

>>>>> "rif" == rif  <···@mit.edu> writes:

    rif> CMUCL aliens do what I want conceptually.  I don't actually need to
    rif> have these arrays garbage collected.  Basically, the way I know of to
    rif> use the least memory for this problem involves allocating one big
    rif> array and just using it for the course of a problem. 

    rif> (Unfortunately, I only get 256 MB of malloc'able "C heap" with the
    rif> current CMUCL.  I don't know if this is easily user changeable; I
    rif> suspect it involves at least a recompile of CMUCL.)

Unfortunately, that's true.  If you don't need much heap or other
spaces, I or someone else could help you create a custom version with
a larger C heap, if you want to try.

Ray

From: Jon Boone
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 18:17:01 +0000
Message-ID: <m3pt36psrm.fsf@spiritus.delamancha.org>

rif <···@mit.edu> writes:

> (Unfortunately, I only get 256 MB of malloc'able "C heap" with the
> current CMUCL.  I don't know if this is easily user changeable; I
> suspect it involves at least a recompile of CMUCL.)

  Which version are you using?

--jon

From: Raymond Toy
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 17:54:49 +0000
Message-ID: <sxdbreqd6om.fsf@edgedsp4.rtp.ericsson.se>

>>>>> "Duane" == Duane Rettig <·····@franz.com> writes:

    Duane> Well, I'm glad you've answered here, because I wouldn't want to make
    Duane> any assumptions.  According to this documentation, CMUCL's alien types
    Duane> seem similar to Allegro CL's foreign types when applied to the domain
    Duane> of foreign calls.  However, the OP's programming style suggests that
    Duane> he wants to be able to allocate, manipulate, and set values into and
    Duane> get values from the various fields in one of these structures, without
    Duane> necessarily using the structure as an argument to a foreign call, and
    Duane> also (since the program is being written in Lisp) to be able to not
    Duane> worry about deallocating the object.  This also implies that perhaps
    Duane> an alien struct type might be able to be overlayed onto a lisp object
    Duane> (preferably a specialized array), or, alternatively, that an alien
    Duane> type object could be allocated in such a way that makes it
    Duane> garbage-collectable.  Is either of these a possibility in CMUCL?  I

Not that I know of.  I think the only way to GC an alien object is to
put a finalizer of some sort around it.  

People have asked for such features, but no one has implemented
anything like this.  If it is a requirement, I just point them to
ACL. :-) Presumably LW and other commercial lisps provide similar
things.

Ray

From: Pascal Bourguignon
Subject: Re: lisp performance questions and observations
Date: Sun, 24 Oct 2004 19:49:29 +0000
Message-ID: <87u0sjvquu.fsf@thalassa.informatimago.com>

rif <···@mit.edu> writes:

> I've been writing a suffix tree program in CL.  It's pretty fast, but
> the C reference program I'm comparing against (I have a binary only,
> not the source) is still a bit faster, and uses much less memory,
> which is important for suffix tree algorithms, because you want to run
> on as large strings as possible.
> 
> I'm using CMUCL.

I'd try gcl, ecl or ThinLisp.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

Voting Democrat or Republican is like choosing a cabin in the Titanic.

From: rif
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 02:57:52 +0000
Message-ID: <wj0mzyb5wsv.fsf@five-percent-nation.mit.edu>

> I'd try gcl, ecl or ThinLisp.

I have taken a brief look at each of these.  My initial impression is
that they do not actually solve the issues I have.  Specifically, I
would want to declare a structure type, and then lay out an array of
that structure type directly in memory.  (In standard CL, I get an
"array of pointers", leading to an extra word/struct and an extra
indirection on every access.)  Do you know (or have high confidence)
that any of these actually let me do this?  If not, I'm not sure how
they help me.

Cheers,

rif

From: Pascal Bourguignon
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 13:43:53 +0000
Message-ID: <87breqvrom.fsf@thalassa.informatimago.com>

rif <···@mit.edu> writes:

> > I'd try gcl, ecl or ThinLisp.
> 
> I have taken a brief look at each of these.  My initial impression is
> that they do not actually solve the issues I have.  Specifically, I
> would want to declare a structure type, and then lay out an array of
> that structure type directly in memory.  (In standard CL, I get an
> "array of pointers", leading to an extra word/struct and an extra
> indirection on every access.)  Do you know (or have high confidence)
> that any of these actually let me do this?  If not, I'm not sure how
> they help me.

There's also http://wcl.kontiki.com/ The paper doesn't mention how it
implements typed arrays though.

In any case, I don't see that Common-Lisp imposes such an
implementation of typed arrays.

    (defstruct rec field1 field2)
    (make-array '(20) :element-type 'rec)

could be all you have to write to get what you want.  If no
implementation does the "Right Thing", you may still modify one that
gives you the freedom to do it.

Or perhaps the difficulty lies in the specification of an
initialization for random types. I think that it's implementation
dependant any way so if you implement typed arrays for structures, you
get to decide how to initialize them when no :initia-element or
:initial-contents is given. Note that make-rec would return a newly
allocated structure.  At least with objects you could just use
initialize-instance, but beware the change of class! (There are
reasons why most implementations don't do it...)

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

Voting Democrat or Republican is like choosing a cabin in the Titanic.

From: Rahul Jain
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 03:33:25 +0000
Message-ID: <87lldvbhfe.fsf@nyct.net>

rif <···@mit.edu> writes:

> 1.  Using structs under CMUCL, there appears to be an overhead of 8
>     bytes/object, and everything is done in eight byte chunks ---
>     either a one field struct or a two field struct costs 16 bytes.
>     Given that the structs I need have an odd number of fields, I'm
>     paying an extra 12 bytes for every node in my tree as compared to
>     what I'd pay in C.

Are you sure you're not gaining in cache-line usage efficiency? You're
guaranteed that a single object can't cross cache-lines if you align
them. If not, you could be loading up twice as much memory as you need
to use.

> 2.  The lack of "homogeneous" arrays of non-primitive types is
>     painful.  It'd be really useful to be able to say "Here's a struct
>     declaration.  Give me an array of these structs, where they're
>     laid out 'in place'.  No extra indirection.

You might be able to do this in CMUCL with the proper majick, but I
don't know where to point you.

> Has anyone else encountered these difficulties when trying to optimize
> algorithms?  Is there anyway around them, in CMUCL or other CL
> implementations?  I've gotten to the point where I know how to write
> numerical matrix-based CL programs that are essentially as fast and
> memory efficient as I'd write in C/C++, but it seems that for
> "complicated data structures", like tree based algorithms, where I'd
> really expect CL to excel, I'm paying a pretty big memory overhead.

If your data structure really is complex, then you're probably doing
rather complex things with it. The fact that your app requires 1024MB of
RAM instead of 768MB is probably not that big of a deal in the grand
scheme of things. By the time you're done doing all the stuff needed to
get the size down to 768MB, 1024MB will cost as much as 768MB did when
you started. But don't tell anyone this, it's the big secret of the
software industry and we'd all be done for if anyone else found out.

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist

From: rif
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 05:21:34 +0000
Message-ID: <wj0is8z5q5d.fsf@five-percent-nation.mit.edu>

> Are you sure you're not gaining in cache-line usage efficiency? You're
> guaranteed that a single object can't cross cache-lines if you align
> them. If not, you could be loading up twice as much memory as you need
> to use.

Maybe, but not too likely.  The C program I'm comparing against
(again, I have a binary, but not a source) is simultaneously slightly
faster and uses way less memory.  

> If your data structure really is complex, then you're probably doing
> rather complex things with it. The fact that your app requires 1024MB of
> RAM instead of 768MB is probably not that big of a deal in the grand
> scheme of things. By the time you're done doing all the stuff needed to
> get the size down to 768MB, 1024MB will cost as much as 768MB did when
> you started. But don't tell anyone this, it's the big secret of the
> software industry and we'd all be done for if anyone else found out.

This is the kind of glib viewpoint that I wish were true, but it's not
really true in this domain.  People are interested in using these
algorithms in genomics problems.  Genomes are long.  If your program
uses a third more memory, you can process strings that are 3/4 as long
as the C program.  Next year, when 1024 MB costs as much as 768 MB
today, you'll still only be able to process strings that are 3/4 as
long as the C program.  Because we are not all that close (in GB or
years) to having "enough" memory, being a memory hog in this domain is
problematic.

rif

From: Alan Shutko
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 16:01:13 +0000
Message-ID: <877jpezt12.fsf@wesley.springies.com>

rif <···@mit.edu> writes:

> This is the kind of glib viewpoint that I wish were true, but it's not
> really true in this domain.  People are interested in using these
> algorithms in genomics problems.  Genomes are long.

It should be noted there are a lot of bioinformatics folk who don't
care about the size of things.  Look at all the code out their using
perl.  (Using one loader, I found it took ablout 1GB memory to load a
30MB sequence.)  Even most of the folks using C are working on
strings, not on a representation that's actually efficient.
Furthermore, right now most work is being done on sections of the
genome, not the whole thing.  

None of the folks I talked to at Cold Spring Harbor were concerned
about the working size of their code.  They were far more concerned
with how fast it could be developed, and then with how fast it could
run.

Sure, there are probably a couple researchers who care about the
working set of their programs.  But they can spend time designing an
efficient data representation, and I suspect that the resultant
representation would be pretty close to the same size in both Lisp and
C.

-- 
Alan Shutko <···@acm.org> - I am the rocks.
Kind of creepy in an unwholesome and twisty way..

From: Cameron MacKinnon
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 16:40:16 +0000
Message-ID: <DMKdnTZMzKPtseDcRVn-rw@golden.net>

Alan Shutko wrote:
> Sure, there are probably a couple researchers who care about the
> working set of their programs.  But they can spend time designing an
> efficient data representation, and I suspect that the resultant
> representation would be pretty close to the same size in both Lisp and
> C.

I skimmed some papers about suffix trees last night (because I wanted to 
see what C-centric algorithms looked like :-)  ) and came across this 
interesting assertion: "Although the utility of suffix trees is well 
known, their usage is limited to small length datasets due to their 
space requirements � the best in-memory implementations so far take up 
to 12.5 times the database size."

I don't know if that is still true, but if so, it suggests that using a 
whole byte to represent a two bit base isn't as wasteful as one might 
think. From my reading of this thread, Rif is the existence proof that 
Alan speculates on above.

I would find it educational to see what suffix trees look like in Lisp, 
so if the other posters would consider continuing to hash this out on 
the newsgroup, I'd appreciate it.

-- 
Cameron MacKinnon
Toronto, Canada

From: rif
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 17:22:07 +0000
Message-ID: <wj0r7nm3e80.fsf@five-percent-nation.mit.edu>

> I would find it educational to see what suffix trees look like in
> Lisp, so if the other posters would consider continuing to hash this
> out on the newsgroup, I'd appreciate it.

I'm travelling today.  I'll post links to my implementation(s)
tomorrow.  It's about 150 lines of CL, uncommented but with reasonable
naming of variables and functions.

Note that I'm not *really* using suffix-trees; I have a colleague
who's very into them, and this is more of an "Is CL good for this sort
of problem?" exercise for me.  My colleague is very concerned about
memory usage, so I felt that was important, and I was exploring how
memory efficient one could reasonably make the CL version.

rif

From: rif
Subject: Re: lisp performance questions and observations
Date: Tue, 26 Oct 2004 20:06:38 +0000
Message-ID: <wj0acu9nt0x.fsf@five-percent-nation.mit.edu>

rif <···@mit.edu> writes:

> > I would find it educational to see what suffix trees look like in
> > Lisp, so if the other posters would consider continuing to hash this
> > out on the newsgroup, I'd appreciate it.

As promised, I've posted my current implementations in
http://fpn.mit.edu/Downloads/SuffixTree

suffix-tree.lisp is the basic implementation.  suffix-tree-2.lisp is a
modified implementation in which the leaves are just fixnums rather
than being "nodes".  Both implementations use the McCreight algorithm.

The implementations are both O(n) in time and space and are therefore
"asymptotically efficient."  The constants are not as good as they
could be.  suffix-tree uses about 80 bytes/char, and suffix-tree-2
uses about 50 bytes/char (under CMUCL 19a).  I believe the best C
implementations use about 25 bytes/char.

Any comments on my code are appreciated; I'm always trying to improve
my skills.  The printing code in particular seems kludgy to me and
doesn't even do what I want (see the comment in suffix-tree.lisp), any
insights on this are most welcome.

Cheers,

rif

From: Marco Antoniotti
Subject: Re: lisp performance questions and observations
Date: Wed, 27 Oct 2004 15:22:51 +0000
Message-ID: <f3Pfd.17$TY6.9636@typhoon.nyu.edu>

rif wrote:
> rif <···@mit.edu> writes:
> 
> 
>>>I would find it educational to see what suffix trees look like in
>>>Lisp, so if the other posters would consider continuing to hash this
>>>out on the newsgroup, I'd appreciate it.
> 
> 
> As promised, I've posted my current implementations in
> http://fpn.mit.edu/Downloads/SuffixTree
> 
> suffix-tree.lisp is the basic implementation.  suffix-tree-2.lisp is a
> modified implementation in which the leaves are just fixnums rather
> than being "nodes".  Both implementations use the McCreight algorithm.
> 
> The implementations are both O(n) in time and space and are therefore
> "asymptotically efficient."  The constants are not as good as they
> could be.  suffix-tree uses about 80 bytes/char, and suffix-tree-2
> uses about 50 bytes/char (under CMUCL 19a).  I believe the best C
> implementations use about 25 bytes/char.

I just Googled for C implementations of suffix trees and found Tsadok's, 
Gusfield's, and Nelson's implementations (all C/C++).  On a cursory 
examination they do not seem to require much a different footprint per 
character than yours.  So I wouldn't be so worried.  Most likely the 
ultra optimized versions probably assume a very small alphabet; most 
likely for genomic applications.

I also bet that any bioinformatics shop that needs a suffix tree 
implementation off-the-shelf, just does what I did. 
Google+grab-the-first :)

> 
> Any comments on my code are appreciated; I'm always trying to improve
> my skills.  The printing code in particular seems kludgy to me and
> doesn't even do what I want (see the comment in suffix-tree.lisp), any
> insights on this are most welcome.
> 

Looks good to me.

Cheers
--
Marco

From: Ross Lippert
Subject: Re: lisp performance questions and observations
Date: Thu, 28 Oct 2004 16:20:08 +0000
Message-ID: <mkj4qken7bb.fsf@MATHSTATION031.MIT.EDU>

>I just Googled for C implementations of suffix trees and found
>Tsadok's, Gusfield's, and Nelson's implementations (all C/C++).  On a
>cursory examination they do not seem to require much a different
>footprint per character than yours.  So I wouldn't be so worried.

Just to review, for those who tuned in late.  The memory requirement
for a suffix tree comes down to
  n*sizeof(leaf)+m*sizeof(node)+n*sizeof(edge)
where m < n, but can always be made = n-1 by some simple examples, so we
may as well consider the footprint of an stree to be
  sizeof(leaf)+sizeof(node)+sizeof(edge)
per character.

In strmat, a STREE_NODE is 6 pointers/integers and a STREE_TLEAF is 4,
for a total of 10, that's 40 bytes per character.  There is no edge
type (that information is stored in the node and leaf structures).

Nelson's implementation takes 3 ints per leaf, and 4 per edge and 2
per node (he treats leaves and internal nodes as nodes), for a total
of 36 bytes per char.  Of course, Nelson's hashtable implementation is
poor enough that in order to even realize that you need to be willing
to put up with some pretty bad timing.

Tsadok's does 8 ints per node, and 4 per leaf, and thus 48 bytes per
character (no explicit edges).

Finally, there is libstree, which puts all the above to shame for
memory use, with what seems to be over 50 bytes per char (though I
admit I cannot understand their code very well).

There exists a binary-executable implementation called REPuter which
uses 20 bytes per character.

This memory analysis assumes 32 bit pointers.  On 64-bit machines the
requirements for these implementations go up substantially even if you
only have 4 Gigs of addressable memory.

If there were time, it would be nice to see how fast each one of these
implementations were.

Should rif be preoccupied with memory use?  Only if he wanted to
distribute it as a library and not be laughed at.  If he is just doing
this recreationally, then perhaps not.  My motivation for making this
reply is partly because I have always been curious as to what the
footprints of these implementations really were, and this was a good
excuse to find out.

However, I do not think the issue is "how good a suffix tree
implementation can rif get in CL?"  I think this thread started with
"homogenous arrays afford me some advantages that I miss and can't
seem to get back."

Pardon me, rif, for paraphrasing.


-r

-- 
Ross A. Lippert
M.I.T., Department of Mathematics
Building 2, Room 335                    Voice (617) 253-7905
77 Massachusetts Avenue                 FAX (617) 253-4358
Cambridge, MA 02139-4307                e-mail:  ·······@math.mit.edu

From: Jakub Travnik
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 10:54:39 +0000
Message-ID: <pan.2004.10.25.10.54.39.332893@sh.cvut.cz.no.sp.am>

On Mon, 25 Oct 2004 01:21:34 -0400, rif wrote:

> This is the kind of glib viewpoint that I wish were true, but it's not
> really true in this domain.  People are interested in using these
> algorithms in genomics problems.  Genomes are long.  If your program
> uses a third more memory, you can process strings that are 3/4 as long
> as the C program.  Next year, when 1024 MB costs as much as 768 MB
> today, you'll still only be able to process strings that are 3/4 as
> long as the C program.  Because we are not all that close (in GB or
> years) to having "enough" memory, being a memory hog in this domain is
> problematic.

So you are using CMUCL for representing genes. Which representation do you
use? Do you know that CMUCL offers very compact arrays (as opposed to most
C implementations)?

Look at this session (version shows: CMU Common Lisp CVS release-19a
19a-release-20040728 + minimal debian patches):

* (room)

Dynamic Space Usage:           88,464 bytes (out of  512 MB).
Read-Only Space Usage:     20,439,376 bytes (out of  256 MB).
Static Space Usage:         3,243,136 bytes (out of  256 MB).
Control Stack Usage:              548 bytes (out of  128 MB).
Binding Stack Usage:               96 bytes (out of  128 MB).
The current dynamic space is 0.
Garbage collection is currently enabled.

Breakdown for dynamic space:
         34,640 bytes for       848 simple-string-type objects.
         27,176 bytes for     3,397 cons objects.
         14,520 bytes for       313 instance objects.
          6,680 bytes for        31 simple-array-unsigned-byte-32-type objects.
          5,392 bytes for       667 bignum objects.
         13,992 bytes for     1,054 other objects.
        102,400 bytes for     6,310 dynamic objects (space total.)
* (defvar bases (make-array 1000000
                   :element-type '(integer 0 3) :initial-element 0))

BASES
* (room)

Dynamic Space Usage:          463,736 bytes (out of  512 MB).
Read-Only Space Usage:     20,439,376 bytes (out of  256 MB).
Static Space Usage:         3,243,136 bytes (out of  256 MB).
Control Stack Usage:              548 bytes (out of  128 MB).
Binding Stack Usage:               96 bytes (out of  128 MB).
The current dynamic space is 0.
Garbage collection is currently enabled.

Breakdown for dynamic space:
        250,008 bytes for         1 simple-array-unsigned-byte-2-type object.
         51,952 bytes for     1,066 simple-string-type objects.
         50,816 bytes for     6,352 sap objects.
         37,936 bytes for     4,742 cons objects.
         29,256 bytes for       596 instance objects.
         59,264 bytes for     3,488 other objects.
        479,232 bytes for    16,245 dynamic objects (space total.)

So 1000000 integers all in range 0 to 3 took 250008 bytes. CMUCL inferred
that such integers can fit into two bits and constructed such space
effective array. Can your C compiler do this too? (without manual bit
tweaking)

It is rarely useful but it may be useful for you.

-- 
"Tempus neminem manet."

Jakub Travnik | ·············@jabber.com | ICQ: 66770334
GnuPG key: http://jtra.sh.cvut.cz/jtra_key.asc

From: Rahul Jain
Subject: Re: lisp performance questions and observations
Date: Wed, 27 Oct 2004 04:43:38 +0000
Message-ID: <87sm80kbyd.fsf@nyct.net>

Jakub Travnik <j.travnik @ sh.cvut.cz.no.sp.am> writes:

>         250,008 bytes for         1 simple-array-unsigned-byte-2-type object.

I guess that answers my question. I knew there were a bunch of sizes
supported, but wasn't sure how many they did. Turns out it supports
packed arrays of 1, 2, 4, 8, 16, and 32 bit values; all the ones you'd
expect from a 32-bit lisp.

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist

From: Rahul Jain
Subject: Re: lisp performance questions and observations
Date: Wed, 27 Oct 2004 04:38:38 +0000
Message-ID: <87wtxckc6p.fsf@nyct.net>

rif <···@mit.edu> writes:

> This is the kind of glib viewpoint that I wish were true, but it's not
> really true in this domain.

Ok, but it's true in many domains. Especially for end-user application
programming.

> People are interested in using these algorithms in genomics problems. 
> Genomes are long.

Are you creating a new struct for each base pair or something? You
should be using (unsigned-byte 8) specialized arrays at worst for your
genome. I figure CMUCL could be hacked without too much trouble to
support (unsigned-byte 2) specialized arrays.

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist

From: rif
Subject: Re: lisp performance questions and observations
Date: Wed, 27 Oct 2004 13:23:39 +0000
Message-ID: <wj0zn28thus.fsf@five-percent-nation.mit.edu>

Rahul Jain <·····@nyct.net> writes:

> rif <···@mit.edu> writes:
> 
> > This is the kind of glib viewpoint that I wish were true, but it's not
> > really true in this domain.
> 
> Ok, but it's true in many domains. Especially for end-user application
> programming.

Agreed.

> > People are interested in using these algorithms in genomics problems. 
> > Genomes are long.
> 
> Are you creating a new struct for each base pair or something? You
> should be using (unsigned-byte 8) specialized arrays at worst for your
> genome. I figure CMUCL could be hacked without too much trouble to
> support (unsigned-byte 2) specialized arrays.

I am representing the string itself (i.e., the "genome") as a CL
string, which is costing one (unsigned-byte 8) per character.  This is
a tiny fraction of the total space used.  I am building a complex data
structure (a suffix tree) with one leaf per character and
approximately one internal node per character.  The internal nodes are
structs, because they have to store a bunch of information, and that
is where all the space is going --- see the rest of the thread for
more details.

Cheers,

rif

From: Raymond Toy
Subject: Re: lisp performance questions and observations
Date: Wed, 27 Oct 2004 13:29:37 +0000
Message-ID: <sxdy8hs9tmm.fsf@edgedsp4.rtp.ericsson.se>

>>>>> "Rahul" == Rahul Jain <·····@nyct.net> writes:

    Rahul> Are you creating a new struct for each base pair or something? You
    Rahul> should be using (unsigned-byte 8) specialized arrays at worst for your
    Rahul> genome. I figure CMUCL could be hacked without too much trouble to
    Rahul> support (unsigned-byte 2) specialized arrays.

With cmucl:

(upgraded-array-element-type '(unsigned-byte 2))
==> (UNSIGNED-BYTE 2)

Ray

From: Marco Antoniotti
Subject: Re: lisp performance questions and observations
Date: Wed, 27 Oct 2004 15:26:14 +0000
Message-ID: <q6Pfd.18$TY6.9981@typhoon.nyu.edu>

Rahul Jain wrote:

> rif <···@mit.edu> writes:
> 
> 
>>This is the kind of glib viewpoint that I wish were true, but it's not
>>really true in this domain.
> 
> 
> Ok, but it's true in many domains. Especially for end-user application
> programming.
> 
> 
>>People are interested in using these algorithms in genomics problems. 
>>Genomes are long.
> 
> 
> Are you creating a new struct for each base pair or something? You
> should be using (unsigned-byte 8) specialized arrays at worst for your
> genome. I figure CMUCL could be hacked without too much trouble to
> support (unsigned-byte 2) specialized arrays.
> 

Yes.  But I wager that you need at least (unsigned-byte 3), even for 
genomic applications.  It is not just ACTG.  If you look around for a 
specification of a FASTA file you will see why.  I do not remember it 
off the top of my head, but you have to account for codes for a non 
better identified purine or pyrimidine, codes for "not known" etc etc.

Cheers
--
Marco

From: Joe Marshall
Subject: Re: lisp performance questions and observations
Date: Mon, 25 Oct 2004 17:39:56 +0000
Message-ID: <mzyavgr7.fsf@ccs.neu.edu>

rif <···@mit.edu> writes:

> I've been writing a suffix tree program in CL.  

What algorithm?