Processing large files.

From: Jacek Generowicz
Subject: Processing large files.
Date: Tue, 03 Jul 2001 09:57:43 +0000
Message-ID: <g03d8el4h4.fsf@scumbag.ecs.soton.ac.uk>

Hi,

So far my lisp experience has been limited mostly to playing around
with snippets of the language, to familiarize myself with it. Now I am
faced with the opportunity to use it `in anger', but to perform a task
whose nature is very different from the sort of thing I have been
looking at from a lisp perspective . . . so I am hoping to find some
sound advice here.

I need to transform a large file of data (STEP). The components that
might need changing are cartesian points, which are represented thus:

#1 = CARTESIAN_POINT ('CRTPNT1', (1.9770351033236283E+1, 7.39E+0,
-1.1875496989536942E+0) ) ;
#2 = CARTESIAN_POINT ('CRTPNT2', (1.8984532085090486E+1,
7.547919664414555E+0, -1.3981092515064347E+0) ) ;

(Although I have wrapped the lines here, normally one point would be
represented on exactly one line.)

Not all lines in the file describe points; these should be left
unaltered.

All points defined in the file should be checked for proximity to any
of the other points, and may then be moved; the final result should be
a file identical in structure to the original, but with the
coordinates of some of the points slightly adjusted.


I would welcome suggestions on how to set about this. Particularly: 

What sort of data structure is appropriate for this task?  
How should it be filled ?
Any lisp features that are pertinent, which a lisp neophyte like
myself is unlikely think of.


Thanks,

Jacek

Re: Processing large files. Raymond Laning
- Re: Processing large files. Peter Van Eynde
Re: Processing large files. Jochen Schmidt
- Re: Processing large files. Jacek Generowicz
  - Re: Processing large files. Jochen Schmidt
    - Re: Processing large files. Al Christians
Re: Processing large files. Gareth McCaughan
- Re: Processing large files. Jacek Generowicz
  - Re: Processing large files. Gareth McCaughan
Re: Processing large files. Andy
- Re: Processing large files. Jacek Generowicz
  - Re: Processing large files. Arseny Slobodjuck
Re: Processing large files. ···············@solibri.com
- Re: Processing large files. Jacek Generowicz
  - Re: Processing large files. Raymond Wiker
  - Re: Processing large files. Gabe Garza
    - Re: Processing large files. ···············@solibri.com
    - Re: Processing large files. Coby Beck
  - Re: Processing large files. ···············@solibri.com

From: Raymond Laning
Subject: Re: Processing large files.
Date: Tue, 03 Jul 2001 15:19:14 +0000
Message-ID: <3B41E272.29AB7DF5@west.raytheon.com>

I solved a similar problem of having to read large files of numeric
control points (gcode), project them onto a complex three-space surface,
and then output the new gcode files (> 100,000 points).  I used Allegro
Common Lisp from Franz with extremely good results (< 1 minute to
process 6 MByte files).

Some tips (for this particular app):

Use vectors or arrays to hold points, and, once you have debugged your
code and gotten the basic algorithm correct, go through and declare the
type of your variables and arrays.  If you need help on how to do that,
this is a good group for advice; there are many subleties to doing this
correctly.

Re-use large arrays, i.e., do your own memory management.  Creating
arrays and then throwing them away generates huge amounts of garbage
which takes time to clean up.  You can use a global variable to hold a
reusable array;  generally, use of globals is discouraged, but here it
is essential.

You may need a btree to efficiently partition your points into adjacent
spaces for manipulation.

HTH,

Raymond Laning

Jacek Generowicz wrote:
> 
> Hi,
> 
> So far my lisp experience has been limited mostly to playing around
> with snippets of the language, to familiarize myself with it. Now I am
> faced with the opportunity to use it `in anger', but to perform a task
> whose nature is very different from the sort of thing I have been
> looking at from a lisp perspective . . . so I am hoping to find some
> sound advice here.
> 
> I need to transform a large file of data (STEP). The components that
> might need changing are cartesian points, which are represented thus:
> 
> #1 = CARTESIAN_POINT ('CRTPNT1', (1.9770351033236283E+1, 7.39E+0,
> -1.1875496989536942E+0) ) ;
> #2 = CARTESIAN_POINT ('CRTPNT2', (1.8984532085090486E+1,
> 7.547919664414555E+0, -1.3981092515064347E+0) ) ;
> 
> (Although I have wrapped the lines here, normally one point would be
> represented on exactly one line.)
> 
> Not all lines in the file describe points; these should be left
> unaltered.
> 
> All points defined in the file should be checked for proximity to any
> of the other points, and may then be moved; the final result should be
> a file identical in structure to the original, but with the
> coordinates of some of the points slightly adjusted.
> 
> I would welcome suggestions on how to set about this. Particularly:
> 
> What sort of data structure is appropriate for this task?
> How should it be filled ?
> Any lisp features that are pertinent, which a lisp neophyte like
> myself is unlikely think of.
> 
> Thanks,
> 
> Jacek

From: Peter Van Eynde
Subject: Re: Processing large files.
Date: Wed, 04 Jul 2001 08:50:22 +0000
Message-ID: <86ith9gn1t.fsf@mustyr.fitit.be>

Raymond Laning <········@west.raytheon.com> writes:

> Use vectors or arrays to hold points, and, once you have debugged your
> code and gotten the basic algorithm correct, go through and declare the
> type of your variables and arrays.  If you need help on how to do that,
> this is a good group for advice; there are many subleties to doing this
> correctly.

Actually on CMUCL it is better to declare as you code. Some things have
'known' limits, even making new types like distance or refraction-index.
In this way if you ever do an mistake and give the wrong thing as a
parameter you'll get an error.

The compiler notes on optimization will help too :-)

> Re-use large arrays, i.e., do your own memory management.  Creating
> arrays and then throwing them away generates huge amounts of garbage
> which takes time to clean up.  You can use a global variable to hold a
> reusable array;  generally, use of globals is discouraged, but here it
> is essential.

*nod* Also be carefull of using format to print numbers: most of the
time it is faster to make a hash-table and to store them beforehand
and in the loop just do a lookup and print.

Groetjes, Peter

-- 
It's logic Jim, but not as we know it. | ········@debian.org
"God, root, what is difference?" - Pitr|
"God is more forgiving." - Dave Aronson| http://cvs2.cons.org/~pvaneynd/

From: Jochen Schmidt
Subject: Re: Processing large files.
Date: Tue, 03 Jul 2001 10:43:31 +0000
Message-ID: <9hs795$etefa$1@ID-22205.news.dfncis.de>

Jacek Generowicz wrote:

> Hi,
> 
> So far my lisp experience has been limited mostly to playing around
> with snippets of the language, to familiarize myself with it. Now I am
> faced with the opportunity to use it `in anger', but to perform a task
> whose nature is very different from the sort of thing I have been
> looking at from a lisp perspective . . . so I am hoping to find some
> sound advice here.
> 
> I need to transform a large file of data (STEP). The components that
> might need changing are cartesian points, which are represented thus:

Siemens' STEP?

> 
> #1 = CARTESIAN_POINT ('CRTPNT1', (1.9770351033236283E+1, 7.39E+0,
> -1.1875496989536942E+0) ) ;
> #2 = CARTESIAN_POINT ('CRTPNT2', (1.8984532085090486E+1,
> 7.547919664414555E+0, -1.3981092515064347E+0) ) ;
> 
> (Although I have wrapped the lines here, normally one point would be
> represented on exactly one line.)
> 
> Not all lines in the file describe points; these should be left
> unaltered.
> 
> All points defined in the file should be checked for proximity to any
> of the other points, and may then be moved; the final result should be
> a file identical in structure to the original, but with the
> coordinates of some of the points slightly adjusted.

On the first glance this sounds a bit like thermodynamic.
I don't know how many points you have within this file but if there are many
then this task is maybe very computionally intensive.
I think local optimization techniques like Simulated Annealing/Treshold 
Accept might be applicable here.

ciao,
Jochen Schmidt

From: Jacek Generowicz
Subject: Re: Processing large files.
Date: Tue, 03 Jul 2001 11:23:30 +0000
Message-ID: <g0n16mtfwt.fsf@scumbag.ecs.soton.ac.uk>

Jochen Schmidt <···@dataheaven.de> writes:

> > I need to transform a large file of data (STEP). The components that
> > might need changing are cartesian points, which are represented thus:
> 
> Siemens' STEP?

STandard for the Exchange of Product model data, I believe.

> > All points defined in the file should be checked for proximity to any
> > of the other points, and may then be moved; the final result should be
> > a file identical in structure to the original, but with the
> > coordinates of some of the points slightly adjusted.
> 
> On the first glance this sounds a bit like thermodynamic.

Nothing that interesting, I fear. It's meant to be a quick-and-dirty
method of snapping together objects which share boundaries, but whose
representations are slightly inaccurate.

> I don't know how many points you have within this file

About 20 thousand, in the end, is my guess.

> but if there are many then this task is maybe very computionally
> intensive.

Absolutely, but still trivial compared to the cost of a subsequent
step in the process, so I'm not too bothered about optimizing this at
all. (Well, there are some obvious ways of improving on the O(n^2)
behaviour)

> I think local optimization techniques like Simulated Annealing/Treshold 
> Accept might be applicable here.

At the moment I am more interested in learning about lisp
technicalities. This is meant to be a quick and-dirty-solution to a
small glitch in a chain of events. I have a good excuse for trying to
use lisp to solve it, and I thought I'd try it in order to broaden my
lisp horizons . . . so I would like to be pointed in the direction of
appropriate features of the language, so that I am not completely
wasting my time.

From: Jochen Schmidt
Subject: Re: Processing large files.
Date: Tue, 03 Jul 2001 12:14:44 +0000
Message-ID: <9hsck8$fei4s$1@ID-22205.news.dfncis.de>

Jacek Generowicz wrote:

> Jochen Schmidt <···@dataheaven.de> writes:
> 
>> > I need to transform a large file of data (STEP). The components that
>> > might need changing are cartesian points, which are represented thus:
>> 
>> Siemens' STEP?
> 
> STandard for the Exchange of Product model data, I believe.
> 
>> > All points defined in the file should be checked for proximity to any
>> > of the other points, and may then be moved; the final result should be
>> > a file identical in structure to the original, but with the
>> > coordinates of some of the points slightly adjusted.
>> 
>> On the first glance this sounds a bit like thermodynamic.
> 
> Nothing that interesting, I fear. It's meant to be a quick-and-dirty
> method of snapping together objects which share boundaries, but whose
> representations are slightly inaccurate.

Ah - I thought you meant it the other way round - enlarging the space 
between the points... (like needed on cirquits)

>> I don't know how many points you have within this file
> 
> About 20 thousand, in the end, is my guess.

That is enough to be computionally intensive I think ;-)

>> but if there are many then this task is maybe very computionally
>> intensive.
> 
> Absolutely, but still trivial compared to the cost of a subsequent
> step in the process, so I'm not too bothered about optimizing this at
> all. (Well, there are some obvious ways of improving on the O(n^2)
> behaviour)

Yes

> 
>> I think local optimization techniques like Simulated Annealing/Treshold
>> Accept might be applicable here.
> 
> At the moment I am more interested in learning about lisp
> technicalities. This is meant to be a quick and-dirty-solution to a
> small glitch in a chain of events. I have a good excuse for trying to
> use lisp to solve it, and I thought I'd try it in order to broaden my
> lisp horizons . . . so I would like to be pointed in the direction of
> appropriate features of the language, so that I am not completely
> wasting my time.

Hm... I don't think that Lisp shines particularily in this problem-space 
but you can certainly gain from the general good abstraction facilities of 
it.

If I understand your problem right you will need some kind of sparse 
representation for the data (as a 2-dimensional array of distances for 20000
points would exceed the memory of most general-purpose (32 bit) PCs)

ciao,
Jochen

From: Al Christians
Subject: Re: Processing large files.
Date: Tue, 03 Jul 2001 17:39:46 +0000
Message-ID: <3B420362.42E00044@PublicPropertySoftware.com>

R-Trees are supposed to be a good data structure for 2-D
data that somehow puts points that are close in the 2-D
sense also close together in the data structure.

Al

Jochen Schmidt wrote:
> 
> Jacek Generowicz wrote:
> 
> > Jochen Schmidt <···@dataheaven.de> writes:
> >
> >> > I need to transform a large file of data (STEP). The components that
> >> > might need changing are cartesian points, which are represented thus:
> >>
> >> Siemens' STEP?
> >
> > STandard for the Exchange of Product model data, I believe.
> >
> >> > All points defined in the file should be checked for proximity to any
> >> > of the other points, and may then be moved; the final result should be
> >> > a file identical in structure to the original, but with the
> >> > coordinates of some of the points slightly adjusted.
> >>
> >> On the first glance this sounds a bit like thermodynamic.
> >
> > Nothing that interesting, I fear. It's meant to be a quick-and-dirty
> > method of snapping together objects which share boundaries, but whose
> > representations are slightly inaccurate.
> 
> Ah - I thought you meant it the other way round - enlarging the space
> between the points... (like needed on cirquits)
> 
> >> I don't know how many points you have within this file
> >
> > About 20 thousand, in the end, is my guess.
> 
> That is enough to be computionally intensive I think ;-)
> 
> >> but if there are many then this task is maybe very computionally
> >> intensive.
> >
> > Absolutely, but still trivial compared to the cost of a subsequent
> > step in the process, so I'm not too bothered about optimizing this at
> > all. (Well, there are some obvious ways of improving on the O(n^2)
> > behaviour)
> 
> Yes
> 
> >
> >> I think local optimization techniques like Simulated Annealing/Treshold
> >> Accept might be applicable here.
> >
> > At the moment I am more interested in learning about lisp
> > technicalities. This is meant to be a quick and-dirty-solution to a
> > small glitch in a chain of events. I have a good excuse for trying to
> > use lisp to solve it, and I thought I'd try it in order to broaden my
> > lisp horizons . . . so I would like to be pointed in the direction of
> > appropriate features of the language, so that I am not completely
> > wasting my time.
> 
> Hm... I don't think that Lisp shines particularily in this problem-space
> but you can certainly gain from the general good abstraction facilities of
> it.
> 
> If I understand your problem right you will need some kind of sparse
> representation for the data (as a 2-dimensional array of distances for 20000
> points would exceed the memory of most general-purpose (32 bit) PCs)
> 
> ciao,
> Jochen

From: Gareth McCaughan
Subject: Re: Processing large files.
Date: Tue, 03 Jul 2001 13:12:37 +0000
Message-ID: <slrn9k3h65.230r.Gareth.McCaughan@g.local>

Jacek Generowicz wrote:

> So far my lisp experience has been limited mostly to playing around
> with snippets of the language, to familiarize myself with it. Now I am
> faced with the opportunity to use it `in anger', but to perform a task
> whose nature is very different from the sort of thing I have been
> looking at from a lisp perspective . . . so I am hoping to find some
> sound advice here.
> 
> I need to transform a large file of data (STEP). The components that
> might need changing are cartesian points, which are represented thus:
> 
> #1 = CARTESIAN_POINT ('CRTPNT1', (1.9770351033236283E+1, 7.39E+0,
> -1.1875496989536942E+0) ) ;
> #2 = CARTESIAN_POINT ('CRTPNT2', (1.8984532085090486E+1,
> 7.547919664414555E+0, -1.3981092515064347E+0) ) ;
...
> All points defined in the file should be checked for proximity to any
> of the other points, and may then be moved; the final result should be
> a file identical in structure to the original, but with the
> coordinates of some of the points slightly adjusted.
> 
> 
> I would welcome suggestions on how to set about this. Particularly: 
> 
> What sort of data structure is appropriate for this task?  
> How should it be filled ?
> Any lisp features that are pertinent, which a lisp neophyte like
> myself is unlikely think of.

(You haven't said what variety of Lisp you're using, nor
what implementation. It might make a difference; see later.
I'll assume you're using Common Lisp. If it's, e.g.,
AutoLisp, then all bets are off as far as I'm concerned;
I don't know anything much about what works well in
AutoLisp.)

You might want to consider preprocessing the file using
something like perl, to get it into a form Lisp enjoys
reading. Maybe turn each line into something like

    (1 "CRTPNT1" 1.9770351033236283E+1 7.39E+0 -1.1875496989536942E+0)

for a "point" line and surround other lines with "...".

You certainly *can* read the original file using Lisp, by the way,
but you'll need fewer lines of code in all if you do it the way
I suggest. (On the other hand, if you're going to be doing a lot
of this sort of thing, a better option might be to develop some
Lisp tools for processing files like this.)

If the file is, as you say, very large, then the main issue
will be checking proximities. The best way to do that depends
on the distribution of your points. Are they clustered?
Uniformly and independently distributed in some region?
or what?

If the points are fairly uniformly distributed, then I would
consider the following approach: divide all of space into
cubes of a carefully chosen size. Classify the points according
to which cube each one falls into. Once you've done that,
each point only needs to be checked against the other points
in the same cube or one of its 26 neighbours. (You may be
able to avoid looking at some neighbours, by considering
where the point lies in the cube.)

Depending on how uniform the distribution of points is, you
could implement this with a 3-dimensional array or with a
hash table. Or, if the points are mostly uniformly distributed
but with occasional outliers, a hybrid strategy might work well.

Another possible approach: adaptive octree subdivision (or
something of the sort). You start by putting all the points
in a single bin. When any bin gets too large, you cut it in half
across the x, y or z axis. When you've processed all the points
you have a sort of binary subdivision of space, with each
region defined by it containing not too many points. The
difficult thing here is keeping track of which regions are
close enough to which others...

Or: calculate the Voronoi regions for the set of points,
remember that they're convex, and search outward from each
point's region.

Or (this one is nice and simple, in contrast to the last two,
but it might not work well): sort the points according to
one coordinate (say the x-coordinate), and then for each point
you only have to consider other points with x-coordinates
reasonably close to that point's x-coordinate. You will lose
if (say) all your points happen to lie on the plane x=1.2345.
For a variant on this, choose the coordinate to use more
carefully; or even choose a line to project onto in some
clever way that allows lines other than the axes.

I would guess that there are problems matching your description
for which each of these approaches is best (although I have
my doubts about the Voronoi one).

Lisp-specific issues:

As I said, you might want to preprocess the file to make the
input side easier. Output will not be a problem; Lisp is
pretty good at that.

Some Lisp implementations are not good at floating-point
stuff because they need to keep "boxing" and "unboxing"
the numbers. (Most implementations represent every object
as a 32-bit word which usually contains a pointer to
the object's data. If you're doing lots of work with
double-precision numbers, the machine can spend more
time chasing pointers and allocating new space for pointers
to point to than actually doing the calculations.)
It will probably help to attach type declarations to
lots of things. It may help to store your floating-point
numbers in specialized arrays. This is an area where
the implementation you use can make a big difference.
I *think* CMUCL is likely to be fastest, followed by
ACL, followed by LispWorks, followed by CLISP, in each
case by quite a wide margin. I don't know about MCL.

-- 
Gareth McCaughan  ················@pobox.com
.sig under construc

From: Jacek Generowicz
Subject: Re: Processing large files.
Date: Wed, 04 Jul 2001 09:16:48 +0000
Message-ID: <g0elrxf3zz.fsf@scumbag.ecs.soton.ac.uk>

················@pobox.com (Gareth McCaughan) writes:

> Jacek Generowicz wrote:

[a description of his file processing problem]

> (You haven't said what variety of Lisp you're using,

Common.

> nor what implementation.

ACL. The files in question are produced by ICAD which is built on ACL;
which is my excuse for trying this with lisp.

[The possibility of hacking the data BEFORE they are written to file,
has crossed my mind, but I suspect that I really do not want to go
there . . . unless anyone has done something like this before and it
wasn't too complicated.]

> You might want to consider preprocessing the file using
> something like perl, to get it into a form Lisp enjoys
> reading. Maybe turn each line into something like
> 
>     (1 "CRTPNT1" 1.9770351033236283E+1 7.39E+0 -1.1875496989536942E+0)
> 
> for a "point" line and surround other lines with "...".

"..." ?

> If the file is, as you say, very large, then the main issue
> will be checking proximities. The best way to do that depends
> on the distribution of your points. Are they clustered?
> Uniformly and independently distributed in some region?
> or what?

They describe a wing, with bits and bobs hanging off it.


[A number of interesting strategy outlines snipped]

> Lisp-specific issues:

[snip]

> Some Lisp implementations are not good at floating-point
> stuff because they need to keep "boxing" and "unboxing"
> the numbers. (Most implementations represent every object
> as a 32-bit word which usually contains a pointer to
> the object's data. If you're doing lots of work with
> double-precision numbers, the machine can spend more
> time chasing pointers and allocating new space for pointers
> to point to than actually doing the calculations.)

This is exactly the sort of area where I am afraid of doing the
completely wrong thing.

Thanks for your thoughtful input,

Jacek

From: Gareth McCaughan
Subject: Re: Processing large files.
Date: Sat, 07 Jul 2001 21:34:37 +0000
Message-ID: <slrn9kf03d.1863.Gareth.McCaughan@g.local>

Jacek Generowicz wrote:

[I said:]
> > You might want to consider preprocessing the file using
> > something like perl, to get it into a form Lisp enjoys
> > reading. Maybe turn each line into something like
> > 
> >     (1 "CRTPNT1" 1.9770351033236283E+1 7.39E+0 -1.1875496989536942E+0)
> > 
> > for a "point" line and surround other lines with "...".
> 
> "..." ?

Sorry, that wasn't very clear, was it? I should have said:
surround other lines with double-quotes. Then the reader can
just grab them verbatim. (Careful, though, about funny
characters within the line. It might actually be safer
to avoid using READ (or READ-FROM-STRING, or whatever)
at all on those lines.

> > If the file is, as you say, very large, then the main issue
> > will be checking proximities. The best way to do that depends
> > on the distribution of your points. Are they clustered?
> > Uniformly and independently distributed in some region?
> > or what?
> 
> They describe a wing, with bits and bobs hanging off it.

So I'd guess that your points will tend to come in
"1-dimensional" clusters and to lie on a smallish
number of surfaces, and (guessing ever more
wildly) that there might be some tendency towards
points sharing the same x-coordinate, say (where
the x-axis is "along the wing". I'm not sure what
implications this has for your choice of algorithm;
you probably want to avoid anything that relies
on points being uniformly distributed in 3-space.

-- 
Gareth McCaughan  ················@pobox.com
.sig under construc

From: Andy
Subject: Re: Processing large files.
Date: Tue, 03 Jul 2001 13:20:38 +0000
Message-ID: <3B41C6A6.35FA3A35@none.com>

This probably won't go down well in this group, but I have always
used awk for this kind of problem.

The scripts are small and execute quickly.

Andy


Jacek Generowicz wrote:

> Hi,
>
> So far my lisp experience has been limited mostly to playing around
> with snippets of the language, to familiarize myself with it. Now I am
> faced with the opportunity to use it `in anger', but to perform a task
> whose nature is very different from the sort of thing I have been
> looking at from a lisp perspective . . . so I am hoping to find some
> sound advice here.
>
> I need to transform a large file of data (STEP). The components that
> might need changing are cartesian points, which are represented thus:
>
> #1 = CARTESIAN_POINT ('CRTPNT1', (1.9770351033236283E+1, 7.39E+0,
> -1.1875496989536942E+0) ) ;
> #2 = CARTESIAN_POINT ('CRTPNT2', (1.8984532085090486E+1,
> 7.547919664414555E+0, -1.3981092515064347E+0) ) ;
>
> (Although I have wrapped the lines here, normally one point would be
> represented on exactly one line.)
>
> Not all lines in the file describe points; these should be left
> unaltered.
>
> All points defined in the file should be checked for proximity to any
> of the other points, and may then be moved; the final result should be
> a file identical in structure to the original, but with the
> coordinates of some of the points slightly adjusted.
>
> I would welcome suggestions on how to set about this. Particularly:
>
> What sort of data structure is appropriate for this task?
> How should it be filled ?
> Any lisp features that are pertinent, which a lisp neophyte like
> myself is unlikely think of.
>
> Thanks,
>
> Jacek

From: Jacek Generowicz
Subject: Re: Processing large files.
Date: Tue, 03 Jul 2001 14:59:35 +0000
Message-ID: <g0pubiawiw.fsf@scumbag.ecs.soton.ac.uk>

Andy <····@none.com> writes:

> This probably won't go down well in this group, but I have always
> used awk for this kind of problem.
> 
> The scripts are small and execute quickly.

While there is a lot to recommend using awk(*) in this instance, I
don't think it will help me learn much about lisp :-)

Jacek


(*) Indeed, this is the first language I thought of using.

From: Arseny Slobodjuck
Subject: Re: Processing large files.
Date: Wed, 04 Jul 2001 02:33:35 +0000
Message-ID: <3b4274df.7506954@212.16.193.13>

On 03 Jul 2001 15:59:35 +0100, Jacek Generowicz <···@ecs.soton.ac.uk>
wrote:

>> This probably won't go down well in this group, but I have always
>> used awk for this kind of problem.
>> 
>> The scripts are small and execute quickly.
>
>While there is a lot to recommend using awk(*) in this instance, I
>don't think it will help me learn much about lisp :-)
If you preprocess your file a little using awk or sed you can then
easily do READ(stream) in loop and get lisp objects without any
'manual' parsing.
You even can surround the data into parentheses and get long list
of data without a loop. 
Further, you may write your own macro character handler for more
complex data structures than that (see set-macro-character) . 

All that simplicity becomes possible with character-based
preprocessing. I think that using lisp for preprocessing makes no
sense while you have awk or sed even for learn purpose since, by my
small (2 year) experience, it is not a field where the lisp shines.

From: ···············@solibri.com
Subject: Re: Processing large files.
Date: Tue, 03 Jul 2001 12:21:00 +0000
Message-ID: <usngexkyb.fsf@solibri.com>

Jacek Generowicz <···@ecs.soton.ac.uk> writes:

> I need to transform a large file of data (STEP). The components that
> might need changing are cartesian points, which are represented thus:
> 
> #1 = CARTESIAN_POINT ('CRTPNT1', (1.9770351033236283E+1, 7.39E+0,
> -1.1875496989536942E+0) ) ;
> #2 = CARTESIAN_POINT ('CRTPNT2', (1.8984532085090486E+1,
> 7.547919664414555E+0, -1.3981092515064347E+0) ) ;
> 
> (Although I have wrapped the lines here, normally one point would be
> represented on exactly one line.)
> 
> Not all lines in the file describe points; these should be left
> unaltered.
> 
> All points defined in the file should be checked for proximity to any
> of the other points, and may then be moved; the final result should be
> a file identical in structure to the original, but with the
> coordinates of some of the points slightly adjusted.
> 
> 
> I would welcome suggestions on how to set about this. Particularly: 
> 
> What sort of data structure is appropriate for this task?  
> How should it be filled ?
> Any lisp features that are pertinent, which a lisp neophyte like
> myself is unlikely think of.

I once wrote a parser for a system using the same file format. I found
CL quite a good fit for the problem. In your case seems you need a
two-pass system: first collect all the points, munge them as required,
then write out the file with the new points. Depending on the file
size you could do this in memory or reading and writing one line at
the time.
- A list could be quite sufficient for collecting the points (assuming
you don't try to do random access stuff on it; use an adjustable array if
you need that).
- The lisp reader can handle STEP files, you just need to change some
of the annoying characters(#',) to safer ones first (and skip the
comments, if any).
- You could do this even as a single-pass thing if you relaxed the
requirement on the identical file structure: as the lines have their internal
reference numbers (#2 etc.), they do not need to be in any given
order, so you could just write all the points to the end of the file
(before the end tags) instead of in their original place in the file. 

--

From: Jacek Generowicz
Subject: Re: Processing large files.
Date: Tue, 03 Jul 2001 14:55:57 +0000
Message-ID: <g0sngeawoy.fsf@scumbag.ecs.soton.ac.uk>

···············@solibri.com writes:

> Jacek Generowicz <···@ecs.soton.ac.uk> writes:
> 
> > I need to transform a large file of data (STEP). The components that
> > might need changing are cartesian points, which are represented thus:
> > 
> > #1 = CARTESIAN_POINT ('CRTPNT1', (1.9770351033236283E+1, 7.39E+0,
> > -1.1875496989536942E+0) ) ;
> > #2 = CARTESIAN_POINT ('CRTPNT2', (1.8984532085090486E+1,
> > 7.547919664414555E+0, -1.3981092515064347E+0) ) ;

[snip]

> I once wrote a parser for a system using the same file format. I found
> CL quite a good fit for the problem. In your case seems you need a
> two-pass system: first collect all the points, munge them as required,
> then write out the file with the new points.

This was the broad plan.

> Depending on the file size you could do this in memory or reading
> and writing one line at the time.

Not sure how I could do it a line at a time, if the points need to
communicate with eachoter.

> - A list could be quite sufficient for collecting the points (assuming
> you don't try to do random access stuff on it; use an adjustable array if
> you need that).

All the obvious approaches I think of involve some sort of random
access.

How should I go about filling the array as I read the file ?

Naively I would try repeadtedly calling (setf (aref ... )), but I
suspect that there are much better ways.

> - The lisp reader can handle STEP files, you just need to change some
> of the annoying characters(#',) to safer ones first

Could you expand on this. I unerstand why [#`,] are `dangerous', but I'm
not sure how you propose I change them. By pre-processing the file
with something else ?

> - You could do this even as a single-pass thing if you relaxed the
> requirement on the identical file structure:

I don't really want to learn the details of STEP, unless I really have
to, so just adulterating some of the data seems safest.

> as the lines have their internal
> reference numbers (#2 etc.), they do not need to be in any given
> order, so you could just write all the points to the end of the file
> (before the end tags) instead of in their original place in the file. 

Could be worth considering . . .

#13251 = CARTESIAN_POINT ('CRTPNT13251', (0.0E+0, 0.0E+0, 0.0E+0) ) ;
#13252 = DIRECTION ('DRCTN13252', (0.0E+0, 0.0E+0, 1.0E+0) ) ;
#13253 = DIRECTION ('DRCTN13253', (1.0E+0, 0.0E+0, 0.0E+0) ) ;
#13254 = AXIS2_PLACEMENT_3D ('A2PL3D13254', #13251, #13252, #13253) ;
#13255 = CARTESIAN_POINT ('CRTPNT13255', (0.0E+0, 0.0E+0, 0.0E+0)) ;

. . . though it does seem that the point identifications are coupled
to the datum line number, so changing the order would require changing
the meta-data, which would complicate the task.

Cheers,

Jacek

From: Raymond Wiker
Subject: Re: Processing large files.
Date: Tue, 03 Jul 2001 15:18:03 +0000
Message-ID: <8666dahwib.fsf@raw.grenland.fast.no>

Jacek Generowicz <···@ecs.soton.ac.uk> writes:

> ···············@solibri.com writes:
> 
> > Depending on the file size you could do this in memory or reading
> > and writing one line at the time.
> 
> Not sure how I could do it a line at a time, if the points need to
> communicate with eachoter.
> 
> > - A list could be quite sufficient for collecting the points (assuming
> > you don't try to do random access stuff on it; use an adjustable array if
> > you need that).
> 
> All the obvious approaches I think of involve some sort of random
> access.
> 
> How should I go about filling the array as I read the file ?
> 
> Naively I would try repeadtedly calling (setf (aref ... )), but I
> suspect that there are much better ways.

        R-Trees, maybe? These are similar to B-trees, but
n-dimensional. Once you have placed your data in an R-tree, you can
iterate across all the points, find all (other) points within a given
range, and modify them (you'll probably need an additional structure
to hold the updated values.)

        Note that you can do this without R-trees; the R-tree just
gives you an efficient mechanism for locating all points in the
vicinity of a given point.

        R-trees are reasonably simple to implement in common lisp, but
you should consider a "global" packing algorithm instead of simply
building the tree by inserting single elements. I've used an algorithm
called STR (Sort, Tile, Recurse), which gave much better search
performance. 

-- 
Raymond Wiker
·············@fast.no

From: Gabe Garza
Subject: Re: Processing large files.
Date: Wed, 04 Jul 2001 07:16:29 +0000
Message-ID: <94cc988a.0107032316.58f4c92a@posting.google.com>

Jacek Generowicz <···@ecs.soton.ac.uk> wrote in message news:<··············@scumbag.ecs.soton.ac.uk>...
> ···············@solibri.com writes:
> 
> > Jacek Generowicz <···@ecs.soton.ac.uk> writes:
> > 
> > > I need to transform a large file of data (STEP). The components that
> > > might need changing are cartesian points, which are represented thus:
> > > 
> > > #1 = CARTESIAN_POINT ('CRTPNT1', (1.9770351033236283E+1, 7.39E+0,
> > > -1.1875496989536942E+0) ) ;
> > > #2 = CARTESIAN_POINT ('CRTPNT2', (1.8984532085090486E+1,
> > > 7.547919664414555E+0, -1.3981092515064347E+0) ) ;

      I haven't been using Common Lisp for long so take everything I
say with a block of salt; hopefully someone with more experience will
offer advice/corrections on what I say. 
 

> > - A list could be quite sufficient for collecting the points (assuming
> > you don't try to do random access stuff on it; use an adjustable array if
> > you need that).
> 
> All the obvious approaches I think of involve some sort of random
> access.
> 
> How should I go about filling the array as I read the file ?
> 
> Naively I would try repeadtedly calling (setf (aref ... )), but I
> suspect that there are much better ways.

   The easiest would be to use an adjustable array with a fill
pointer:

   (make-array <rough estimate of size> :adjustable t :fill-pointer 0)

   then fill it in with VECTOR-PUSH-EXTEND
 
> > - The lisp reader can handle STEP files, you just need to change some
> > of the annoying characters(#',) to safer ones first
> 
> Could you expand on this. I unerstand why [#`,] are `dangerous', but I'm
> not sure how you propose I change them. By pre-processing the file
> with something else ?

Perhaps something like:

(declaim (special *STEP-readtable*))
(let ((new-readtable (copy-readtable)))
  (flet
      ((unfrob-chars (&rest chars) 
         (loop for char in chars do
	   (set-syntax-from-char char #\a new-readtable *readtable*))))
    (setf (readtable-case new-readtable) ':preserve)
    ;Use #\' to delimate strings
    (set-syntax-from-char #\' #\" new-readtable *readtable*)
    ;Erase commas
    (set-syntax-from-char #\, #\space new-readtable *readtable*)
    ;Strip the following of their special meaning
    (unfrob-chars ··@ #\" #\# #\; #\` #\\ #\|)
    (setf *STEP-readtable* new-readtable)))

(defun read-STEP-from-string (string &aux old-table)
  (setf old-table *readtable*)
  (unwind-protect
      (progn
	(setf *readtable* *STEP-readtable*)
	(let ((tokens '())
	      (pos 0))
	  (loop 
	    (multiple-value-bind (token n)
		(read-from-string string nil :eof :start pos)
	      (when (eq token :eof) (return (nreverse tokens)))
	      (setf pos n)
	      (push token tokens)))))
    (setf *readtable* old-table)))


Read a line from your file, hand it to read-STEP-from-string:

(read-STEP-from-string "#1 = CARTESIAN_POINT ('CRTPNT1',
(1.9770351033236283E+1, 7.39E+0, -1.1875496989536942E+0) ) ;")

==> 

(|#1| = CARTESIAN_POINT ("CRTPNT1" (19.770351 7.39 -1.1875497)) |;|)

Once you have a list like that, it ought to be fairly easy to convert
to a more comfortable data structure(s) (e.g., interning the name
and hashing it to a structure that holds information about the point,
using an array to index a number to a point, etc.)

If STEP files have comments, use set-syntax-from-char in the (LET #) 
form above to set the syntax of that character to the syntax of #\; 
(if they are LISP style line comments; if not hopefully you have an 
idea of what you can do...)

You may also want to make # a macro character such that #123 ==> 123 
(this assumes that you can figure out what a number represents from 
context, of not you'll have to something marginally more complicated).

If you want double-precision floating point, don't forget:

(setf *READ-DEFAULT-FLOAT-FORMAT* 'double-float)

Good luck!

Gabe Garza

From: ···············@solibri.com
Subject: Re: Processing large files.
Date: Wed, 04 Jul 2001 10:30:51 +0000
Message-ID: <ulmm5x9yc.fsf@solibri.com>

······@cory.eecs.berkeley.edu (Gabe Garza) writes:

< readtable -based conversion piece omitted >

> 
> 
> Read a line from your file, hand it to read-STEP-from-string:
> 
> (read-STEP-from-string "#1 = CARTESIAN_POINT ('CRTPNT1',
> (1.9770351033236283E+1, 7.39E+0, -1.1875496989536942E+0) ) ;")
> 
> ==> 
> 
> (|#1| = CARTESIAN_POINT ("CRTPNT1" (19.770351 7.39 -1.1875497)) |;|)

Some caveats: 
There may be varying amounts of whitespace in the data, for example either
#2=CARTESIAN... 
or
#2 = Cartesian...
or
#2= cartesian...
etc.

From a performance viewpoint it may be useful to first check that the
line contains a  CARTESIAN_POINT before parsing it any further,
assuming you are only changing the point data.

One problem with readtable -based character conversion is that you
might need a little more tricks to make sure you don't change the
characters inside quoted strings in the data (if it makes any
difference in your application).

> If STEP files have comments, use set-syntax-from-char in the (LET #) 
> form above to set the syntax of that character to the syntax of #\; 
> (if they are LISP style line comments; if not hopefully you have an 
> idea of what you can do...)

C-style /* ... */ comments. I don't remember if they can also extend
to multiple lines, see the standard.

From: Coby Beck
Subject: Re: Processing large files.
Date: Thu, 05 Jul 2001 13:28:29 +0000
Message-ID: <1YZ07.95817$_T2.24506713@typhoon.tampabay.rr.com>

"Gabe Garza" <······@cory.eecs.berkeley.edu> wrote in message
·································@posting.google.com...
> Jacek Generowicz <···@ecs.soton.ac.uk> wrote in message
news:<··············@scumbag.ecs.soton.ac.uk>...
> > ···············@solibri.com writes:
> >
> > > Jacek Generowicz <···@ecs.soton.ac.uk> writes:
> > >
> > > > I need to transform a large file of data (STEP). The components that
> > > > might need changing are cartesian points, which are represented
thus:
> > > >
> > > > #1 = CARTESIAN_POINT ('CRTPNT1', (1.9770351033236283E+1, 7.39E+0,
> > > > -1.1875496989536942E+0) ) ;
> > > > #2 = CARTESIAN_POINT ('CRTPNT2', (1.8984532085090486E+1,
> > > > 7.547919664414555E+0, -1.3981092515064347E+0) ) ;
>
>       I haven't been using Common Lisp for long so take everything I
> say with a block of salt; hopefully someone with more experience will
> offer advice/corrections on what I say.
>
[snip]

> (defun read-STEP-from-string (string &aux old-table)
>   (setf old-table *readtable*)

[snip]

>     (setf *readtable* old-table)))
>

Just FYI: as a special variable, you do not have to do the above for
*readtable*.  If you write:

(let ((*readtable* *step-readtable*))

  (...))

lisp will politely restore the previous binding for you!  Very civilized,
don't you think  : )

Coby

From: ···············@solibri.com
Subject: Re: Processing large files.
Date: Tue, 03 Jul 2001 22:50:39 +0000
Message-ID: <upubhy6dc.fsf@solibri.com>

Jacek Generowicz <···@ecs.soton.ac.uk> writes:

> ···············@solibri.com writes:

> > I once wrote a parser for a system using the same file format. I found
> > CL quite a good fit for the problem. In your case seems you need a
> > two-pass system: first collect all the points, munge them as required,
> > then write out the file with the new points.
> 
> This was the broad plan.
> 
> > Depending on the file size you could do this in memory or reading
> > and writing one line at the time.
> 
> Not sure how I could do it a line at a time, if the points need to
> communicate with eachoter.

That was the first pass: only to collect the points. Obviously, they
have all to be in some in-memory data structure, this one line at a
time only refers to reading/writing the unmodified stuff.

> > - A list could be quite sufficient for collecting the points (assuming
> > you don't try to do random access stuff on it; use an adjustable array if
> > you need that).

> How should I go about filling the array as I read the file ?
> 
> Naively I would try repeadtedly calling (setf (aref ... )), but I
> suspect that there are much better ways.

Not a bad way. Another way might be to collect the stuff into a list
by (push ...), and when you have got them all, make a vector of
it. Might be more effective than constantly adjusting array size.

> > - The lisp reader can handle STEP files, you just need to change some
> > of the annoying characters(#',) to safer ones first
> 
> Could you expand on this. I unerstand why [#`,] are `dangerous', but I'm
> not sure how you propose I change them. By pre-processing the file
> with something else ?

You can do that externally with normal Unix tools (tr, awk etc), or
just read the data first by read-line, so you have a string, and then
doctor the stuff in Lisp. Probably easier in Lisp, as you have to
parse it enough so that you won't be munging the characters inside the
' ' -quoted strings in the data.

> 
> > - You could do this even as a single-pass thing if you relaxed the
> > requirement on the identical file structure:
> 
> I don't really want to learn the details of STEP, unless I really have
> to, so just adulterating some of the data seems safest.
> 
> > as the lines have their internal
> > reference numbers (#2 etc.), they do not need to be in any given
> > order, so you could just write all the points to the end of the file
> > (before the end tags) instead of in their original place in the file. 
> 
> Could be worth considering . . .
> 
> #13251 = CARTESIAN_POINT ('CRTPNT13251', (0.0E+0, 0.0E+0, 0.0E+0) ) ;
> #13252 = DIRECTION ('DRCTN13252', (0.0E+0, 0.0E+0, 1.0E+0) ) ;
> #13253 = DIRECTION ('DRCTN13253', (1.0E+0, 0.0E+0, 0.0E+0) ) ;
> #13254 = AXIS2_PLACEMENT_3D ('A2PL3D13254', #13251, #13252, #13253) ;
> #13255 = CARTESIAN_POINT ('CRTPNT13255', (0.0E+0, 0.0E+0, 0.0E+0)) ;
> 
> . . . though it does seem that the point identifications are coupled
> to the datum line number, so changing the order would require changing
> the meta-data, which would complicate the task.

If you dug out the relevant standard (ISO 10303/21, IIRC), you'd find
out that the #13253 -style references are not line numbers: they may
be in any order (outside the header and end marker blocks), also
non-sequential, with missing numbers etc. So the lines should be safe 
to re-order. (Pretty funny writing that parser, too: forward
references are quite possible. Wound up looping through the unresolved
lines several times before getting them all...)

--