From: Alex Mizrahi
Subject: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <2le3nlFb82reU1@uni-berlin.de>
Hello, All!

i have 3mb long XML document with about 150000 lines (i think it has about
200000 elements there) which i want to parse to DOM to work with.
first i thought there will be no problems, but there were..
first i tried Python.. there's special interest group that wants to "make
Python become the premier language for XML processing" so i thought there
will be no problems with this document..
i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i killed
it - i don't want such processing.. i think this is because of that fat
class impementation - possibly Python had some significant overhead for each
object instance, or something like this..

then i asdf-installed s-xml package and tried it with it. it ate only 25
megs for lxml representation. i think interning element names helped a lot..
it was CLISP that has unicode inside, so i think it could be even less
without unicode..

then i tried C++ - TinyXML. it was fast, but ate 65 megs.. ye, looks like
interning helps a lot 8-]

then i tried Perl XML::DOM.. it was better than python - about 180megs, but
it was slowest.. at least it consumed mem slower than python 8-]

and java.. with default parser it took 45mbs.. maybe it interned strings,
but there was overhead from classes - storing trees is definitely what's
lisp optimized for 8-]

so lisp is winner.. but it has not standard way (even no non-standard but
simple) way to write binary IEEE floating point representation, so common
lisp suck and i will use c++ for my task.. 8-]]]

With best regards, Alex 'killer_storm' Mizrahi.

From: Peter Hansen
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <sr2dnedSLYPKaWzdRVn-sA@powergate.ca>
Alex Mizrahi wrote:

> i have 3mb long XML document with about 150000 lines (i think it has about
> 200000 elements there) which i want to parse to DOM to work with.

Often, problems with performance come down the using the
wrong algorithm, or using the wrong architecture for the
problem at hand.

Are you absolutely certain that using a full in-memory DOM
representation is the best for your problem?  It seems
very unlikely to me that it really is...

For example, there are approaches which can read in the
document incrementally (and I'm not just talking SAX here),
rather than read the whole thing at once.

I found your analysis fairly simplistic, on the whole...

-Peter
From: Paul Rubin
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <7xllhpc3yx.fsf@ruckus.brouhaha.com>
Peter Hansen <·····@engcorp.com> writes:
> For example, there are approaches which can read in the
> document incrementally (and I'm not just talking SAX here),
> rather than read the whole thing at once.

Rather than either reading incrementally or else slurping in the
entire document in many-noded glory, I wonder if anyone's implemented
a parser that scans over the XML doc and makes a compact sequential
representation of the tree structure, and then provides access methods
that let you traverse the tree as if it were a real DOM, by fetching
the appropriate strings from the (probably mmap'ed) disk file as you
walk around in the tree.
From: Tim Bradshaw
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <fbc0f5d1.0407120244.188e6e7f@posting.google.com>
Paul Rubin <·············@NOSPAM.invalid> wrote in message news:<··············@ruckus.brouhaha.com>...

> Rather than either reading incrementally or else slurping in the
> entire document in many-noded glory, I wonder if anyone's implemented
> a parser that scans over the XML doc and makes a compact sequential
> representation of the tree structure, and then provides access methods
> that let you traverse the tree as if it were a real DOM, by fetching
> the appropriate strings from the (probably mmap'ed) disk file as you
> walk around in the tree.

I dunno if this has been done recently, but this is the sort of thing
that people used to do for very large SGML documents.  I forget the
details, but I remember things that were some hundreds of MB (parsed
or unparsed I'm not sure) which would be written out in some parsed
form into large files, which could then be manipulated as if the whole
object was there.  Of course no one would care about a few hundred MB
of memory now, but they did then (this was 91-92 I think).

I had a theory of doing this all lazily, so you wouldn't have to do
the (slow) parsing step up-front but would just lie and say `OK, I
parsed it', then actually doing the work only on demand.

--tim
From: Paul Rubin
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <7xd6315amk.fsf@ruckus.brouhaha.com>
··········@tfeb.org (Tim Bradshaw) writes:
> I had a theory of doing this all lazily, so you wouldn't have to do
> the (slow) parsing step up-front but would just lie and say `OK, I
> parsed it', then actually doing the work only on demand.

1) I think you do need to do the parsing up front, since you have to
provide basically random access to the contents of the tree.

2) The parsing needn't be slow.  It should just make a very fast
linear scan over the document, emitting an in-memory record every time
it sees a tag that makes a new tree node.  It would remember on a
stack where it is in the tree structure as it scans, and whenever it
found a tag that closes off a subtree (e.g. </table>), it could emit
pointers to where that subtree's parent and previously seen sibling
are, and update the parent to know where the last-seen child is.
Finally, it could make another pass over the in-memory structure to
vectorize the list of children at each node.  This could be done with
even less memory by making two passes over the document (one to count
the children for each node and one to build the tree with a vector of
children at each node, allocated to the now-known size), at the
expense of some speed.  The point is, not much processing needs to be
done during the scan.  It should certainly be possible to parse a 3 MB
document in under a second (I assume we're talking about a C
extension).  It just baffles me why the libraries that are out there
are so much slower.
From: Alex Mizrahi
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <2lfd4lFbjvcnU1@uni-berlin.de>
(message (Hello 'Paul)
(you :wrote  :on '(11 Jul 2004 20:13:42 -0700))
(

 >> For example, there are approaches which can read in the document
 >> incrementally (and I'm not just talking SAX here), rather than read
 >> the whole thing at once.

 PR> Rather than either reading incrementally or else slurping in the
 PR> entire document in many-noded glory, I wonder if anyone's
 PR> implemented a parser that scans over the XML doc and makes a compact
 PR> sequential representation of the tree structure, and then provides
 PR> access methods that let you traverse the tree as if it were a real
 PR> DOM, by fetching the appropriate strings from the (probably mmap'ed)
 PR> disk file as you walk around in the tree.

that would be nice.. i remember i've did something like this for one binary
chunky format - thingie avoided allocating new memory as long as possible..

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
(prin1 "Jane dates only Lisp programmers"))
From: Cameron Laird
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <5jpas1-81p.ln1@lairds.us>
In article <··············@ruckus.brouhaha.com>,
Paul Rubin  <·············@NOSPAM.invalid> wrote:
>Peter Hansen <·····@engcorp.com> writes:
>> For example, there are approaches which can read in the
>> document incrementally (and I'm not just talking SAX here),
>> rather than read the whole thing at once.
>
>Rather than either reading incrementally or else slurping in the
>entire document in many-noded glory, I wonder if anyone's implemented
>a parser that scans over the XML doc and makes a compact sequential
>representation of the tree structure, and then provides access methods
>that let you traverse the tree as if it were a real DOM, by fetching
>the appropriate strings from the (probably mmap'ed) disk file as you
>walk around in the tree.

While I don't yet follow all the places this thread has gone,
tDOM <URL: http://wiki.tcl.tk/tdom > is where I turn when I
want *fast* DOMish handling.  Although its author favors Tcl,
there's no particular reason not to use it with Python.
From: Alex Mizrahi
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <2lfcvoFbs9m0U1@uni-berlin.de>
(message (Hello 'Peter)
(you :wrote  :on '(Sun, 11 Jul 2004 22:15:50 -0400))
(

 >> i have 3mb long XML document with about 150000 lines (i think it has
 >> about 200000 elements there) which i want to parse to DOM to work
 >> with.

 PH> Often, problems with performance come down the using the wrong
 PH> algorithm, or using the wrong architecture for the problem at hand.

i see nothing wrong in loading 3 mb data into RAM. however, implementation
details made it 100 times larger and it was the problem..

 PH> Are you absolutely certain that using a full in-memory DOM
 PH> representation is the best for your problem?  It seems very unlikely
 PH> to me that it really is...

format i'm dealing with is quite chaotic and i'm going to work with it
interactively - track down myself where data i need lie and see how can i
extract data..
it's only a small part of task and it's needed only temporarily, so i don't
need best thing possible - i need something that just works..

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
(prin1 "Jane dates only Lisp programmers"))
From: Peter Hansen
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <FeKdnQvhA-sbEm_dRVn-gQ@powergate.ca>
Alex Mizrahi wrote:

> (message (Hello 'Peter)
>  >> i have 3mb long XML document with about 150000 lines (i think it has
>  >> about 200000 elements there) which i want to parse to DOM to work
>  >> with.
> 
>  PH> Often, problems with performance come down the using the wrong
>  PH> algorithm, or using the wrong architecture for the problem at hand.
> 
> i see nothing wrong in loading 3 mb data into RAM. however, implementation
> details made it 100 times larger and it was the problem..

What is problematic about that for you?

>  PH> Are you absolutely certain that using a full in-memory DOM
>  PH> representation is the best for your problem?  It seems very unlikely
>  PH> to me that it really is...
> 
> format i'm dealing with is quite chaotic and i'm going to work with it
> interactively - track down myself where data i need lie and see how can i
> extract data..

You didn't mention this before.  If you're doing it interactively,
which I assume means with you actually typing lines of code that
will be executed in real-time, as you hit ENTER, then why the heck
are you concerned about the RAM footprint (i.e. there's nothing wrong
with loading 100MB of data into RAM for such a case either) or
even performance (since clearly you are going to be spending many
times more time working with the data than it takes to parse it,
even with some of the slower methods)?

Like I said, pick the right architecture for your problem domain...

-Peter
From: Ralph Richard Cook
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <40f2003f.44295613@newsgroups.bellsouth.net>
"Alex Mizrahi" <········@hotmail.com> wrote:

>Hello, All!
>
>i have 3mb long XML document with about 150000 lines (i think it has about
>200000 elements there) which i want to parse to DOM to work with.
...
>then i asdf-installed s-xml package and tried it with it. it ate only 25
>megs for lxml representation. i think interning element names helped a lot..
>it was CLISP that has unicode inside, so i think it could be even less
>without unicode..
...
>With best regards, Alex 'killer_storm' Mizrahi.
>
I too like s-xml, but I prefer to use the xml-struct-dom way, not the
lxml. 

Could you try your test with xml-struct-dom? Also, could you throw a
(time ) call around your parses for the two?
From: Alex Mizrahi
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <2lfensFatbqfU1@uni-berlin.de>
(message (Hello 'Ralph)
(you :wrote  :on '(Mon, 12 Jul 2004 03:10:25 GMT))
(

 RRC> I too like s-xml, but I prefer to use the xml-struct-dom way, not
 RRC> the lxml.

 RRC> Could you try your test with xml-struct-dom? Also, could you throw
 RRC> a (time ) call around your parses for the two?

time says 11 seconds for both(lxml could be faster 0.5 seconds - but it's
below measurement accuracy because this machine is doing some tasks in
background).
size for lxml is about 8 and for xml-struct - about 15 megs.
(my previous measure was incorrect because i had all file read into string -
and it was about 11 megs, and this time i subtracted lisp image size that is
about 6 megs)

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
(prin1 "Jane dates only Lisp programmers"))
From: Ivan Boldyrev
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <9oj9s1xf94.ln2@ibhome.cgitftp.uiggm.nsc.ru>
On 8803 day of my life Alex Mizrahi wrote:
> so lisp is winner.. but it has not standard way (even no non-standard but
> simple) way to write binary IEEE floating point representation, so common
> lisp suck and i will use c++ for my task.. 8-]]]

Here is (portable?) code for storing floating point number.  I have
implemented it for single-float, but if you ask very politely (may be
in Russian), I can try to implement it for double-float.  Or you can
do it yourself.  If so, please share result.

;;; ieee.lisp by Ivan Boldyrev
;;; You can freely use code below without any restriction.
;;; NO WARRANTY, and so on.

;(declaim (optimize (speed 3)))

(defun sf->int (x)
  (declare (type single-float x))
  (multiple-value-bind (m e s) (integer-decode-float x)
    (declare 
     (type (unsigned-byte 24) m)
     (type (integer -1 1) s))
    (let
	((eb (+ 150 e)) ; 150=127+23
	 (sgn (if (plusp s) #X00000000 #X80000000)))
      (declare
       (type unsigned-byte eb))
      (logior (mod m #X800000) (ash eb 23) sgn))))

(defun write-int (output n)
  (declare (type (unsigned-byte 32) n))
  (dotimes (i 4)
    (write-byte (mod n 256) output)
    (setf n (ash n -8)))) ; Look everybody, I use setf!!! :)

;;; IEEE 754 floating point converting
(defun int->sf (x)
  (declare (type (unsigned-byte 32) x))
  (let ((sign (if (logbitp 31 x) -1.0 +1.0)) 
	(mantissa (ldb (byte 23 0) x))
	(expt (ldb (byte 8 23) x)))
    (declare
     (type (unsigned-byte 23) mantissa)
     (type (unsigned-byte  8) expt))
    (case expt
      #|(255
       (cond
	 ((plusp mantissa)
	  +NaN+)                                ; Not-a-Number
	 ((plusp sign) +PlusINF+)               ; +INF
	 (t +MinusINF+)))|#                     ; -INF
      (0
       (if (zerop mantissa)
	   0.0
	   (* sign
	      (scale-float
	       (coerce mantissa 'single-float)
	       -149))))                         ; Denormalized numbers
      (otherwise
       (* sign
	  (scale-float
	   (coerce (logior mantissa #X800000) 'single-float)
	   (- expt 150)))))))

;;; test.dat content will be checked by C program
(defun test() 
    (with-open-file (bos 
            "test.dat"
            :direction :output
            :element-type 'unsigned-byte)
        (dotimes (i 10000)
            (write-int
                bos
                (sf->int (* 0.001 (coerce (- i 5000) 'single-float)))))))

(defun test2 (x)
  (declare (type single-float x))
  (int->sf (sf->int x)))


-- 
Ivan Boldyrev

                                                  Your bytes are bitten.
From: Marco Antoniotti
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <liSIc.33$2i5.35680@typhoon.nyu.edu>
Does this do anything different from the PARSE-FLOAT library in the 
AI.Repository?

Cheers

Marco




Ivan Boldyrev wrote:
> On 8803 day of my life Alex Mizrahi wrote:
> 
>>so lisp is winner.. but it has not standard way (even no non-standard but
>>simple) way to write binary IEEE floating point representation, so common
>>lisp suck and i will use c++ for my task.. 8-]]]
> 
> 
> Here is (portable?) code for storing floating point number.  I have
> implemented it for single-float, but if you ask very politely (may be
> in Russian), I can try to implement it for double-float.  Or you can
> do it yourself.  If so, please share result.
> 
> ;;; ieee.lisp by Ivan Boldyrev
> ;;; You can freely use code below without any restriction.
> ;;; NO WARRANTY, and so on.
> 
> ;(declaim (optimize (speed 3)))
> 
> (defun sf->int (x)
>   (declare (type single-float x))
>   (multiple-value-bind (m e s) (integer-decode-float x)
>     (declare 
>      (type (unsigned-byte 24) m)
>      (type (integer -1 1) s))
>     (let
> 	((eb (+ 150 e)) ; 150=127+23
> 	 (sgn (if (plusp s) #X00000000 #X80000000)))
>       (declare
>        (type unsigned-byte eb))
>       (logior (mod m #X800000) (ash eb 23) sgn))))
> 
> (defun write-int (output n)
>   (declare (type (unsigned-byte 32) n))
>   (dotimes (i 4)
>     (write-byte (mod n 256) output)
>     (setf n (ash n -8)))) ; Look everybody, I use setf!!! :)
> 
> ;;; IEEE 754 floating point converting
> (defun int->sf (x)
>   (declare (type (unsigned-byte 32) x))
>   (let ((sign (if (logbitp 31 x) -1.0 +1.0)) 
> 	(mantissa (ldb (byte 23 0) x))
> 	(expt (ldb (byte 8 23) x)))
>     (declare
>      (type (unsigned-byte 23) mantissa)
>      (type (unsigned-byte  8) expt))
>     (case expt
>       #|(255
>        (cond
> 	 ((plusp mantissa)
> 	  +NaN+)                                ; Not-a-Number
> 	 ((plusp sign) +PlusINF+)               ; +INF
> 	 (t +MinusINF+)))|#                     ; -INF
>       (0
>        (if (zerop mantissa)
> 	   0.0
> 	   (* sign
> 	      (scale-float
> 	       (coerce mantissa 'single-float)
> 	       -149))))                         ; Denormalized numbers
>       (otherwise
>        (* sign
> 	  (scale-float
> 	   (coerce (logior mantissa #X800000) 'single-float)
> 	   (- expt 150)))))))
> 
> ;;; test.dat content will be checked by C program
> (defun test() 
>     (with-open-file (bos 
>             "test.dat"
>             :direction :output
>             :element-type 'unsigned-byte)
>         (dotimes (i 10000)
>             (write-int
>                 bos
>                 (sf->int (* 0.001 (coerce (- i 5000) 'single-float)))))))
> 
> (defun test2 (x)
>   (declare (type single-float x))
>   (int->sf (sf->int x)))
> 
> 
From: Ivan Boldyrev
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <0mods1xk9u.ln2@ibhome.cgitftp.uiggm.nsc.ru>
--=-=-=
Content-Type: text/plain

On 8804 day of my life Marco Antoniotti wrote:
> Does this do anything different from the PARSE-FLOAT library in the
> AI.Repository?

I don't know, but I wrote that code myself.  I will check
AI.Repository tomorrow at work.

>> ;;; ieee.lisp by Ivan Boldyrev
>> ;;; You can freely use code below without any restriction.
>> ;;; NO WARRANTY, and so on.


-- 
Ivan Boldyrev

                                                  Is 'morning' a gerund?

--=-=-=
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.3.6 (GNU/Linux)

iD8DBQBA9DOg4rmsj66VbhcRAlPXAJ9yrX+zF7v53uAi0T/VfFQjcUxmsgCbBUfZ
cAet40kMO7OyQVJ709HBIus=
=8aKH
-----END PGP SIGNATURE-----
--=-=-=--
From: Ivan Boldyrev
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <1e5gs1xg2h.ln2@ibhome.cgitftp.uiggm.nsc.ru>
--=-=-=
Content-Type: text/plain

On 8804 day of my life Marco Antoniotti wrote:
> Does this do anything different from the PARSE-FLOAT library in the
> AI.Repository?

PARSE-FLOAT converts string into float.  INT->SF converts IEEE
representation of float into float.

>> ;;; ieee.lisp by Ivan Boldyrev
>> ;;; You can freely use code below without any restriction.
>> ;;; NO WARRANTY, and so on.

-- 
Ivan Boldyrev

Violets are red, Roses are blue. //
I'm schizophrenic, And so am I.

--=-=-=
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.3.6 (GNU/Linux)

iD8DBQBA9Wah4rmsj66VbhcRAgQTAJwIzrmyj/uIgtmut5Kks+xjYvx7HACfQXbY
CgifLaNP3tQfzSR75OfiDJ0=
=1LNA
-----END PGP SIGNATURE-----
--=-=-=--
From: Frank Buss
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <ccu28p$59e$1@newsreader2.netcologne.de>
"Alex Mizrahi" <········@hotmail.com> wrote:

> so lisp is winner.. but it has not standard way (even no non-standard
> but simple) way to write binary IEEE floating point representation, so
> common lisp suck and i will use c++ for my task.. 8-]]]

This is a silly reason. You can integrate a C program for this, if you 
like, but Common Lisp has the function INTEGER-DECODE-FLOAT, which you can 
use for this, see for example this listing:

http://www.ai.mit.edu/people/cvince/date/src/code/java/serialization.lisp

But you need not to use this little complicated function, because CLISP has 
the non-standard function EXT:WRITE-FLOAT, which "writes a floating-point 
number in IEEE 754 binary representation":

http://clisp.cons.org/impnotes/stream-dict.html#bin-output

I assume other Lisp implementations have something similiar.

-- 
Frank Bu�, ··@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
From: Alex Mizrahi
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <2lfpbcFcegjdU1@uni-berlin.de>
(message (Hello 'Frank)
(you :wrote  :on '(Mon, 12 Jul 2004 13:04:25 +0000 (UTC)))
(

 FB>  but Common Lisp has the function INTEGER-DECODE-FLOAT, which you can
 FB> use for this, see for example this listing:

i asked about half year ago how to do it, but there we no helpful responses
that time:
http://www.google.com.ua/groups?hl=ru&lr=&ie=UTF-8&selm=bqa8io%241vjv2c%241%40ID-177567.news.uni-berlin.de

 FB> http://www.ai.mit.edu/people/cvince/date/src/code/java/serialization.
 FB> lisp

they are not sure if it works correctly
; This should implement IEEE 754 floating-point correctly!
; Close enough for now.
but if it works it's nice. i knew that this is possible to do it in such
way, however i didn't want to deal with fp number format myself 8-]

 FB> But you need not to use this little complicated function, because
 FB> CLISP has  the non-standard function EXT:WRITE-FLOAT, which "writes
 FB> a floating-point  number in IEEE 754 binary representation":

 FB> http://clisp.cons.org/impnotes/stream-dict.html#bin-output

that's even better.. so i'll do all this processing in clisp, i think..

 FB> I assume other Lisp implementations have something similiar.

i searched such functions for a long time in LispWorks documentation but
find nothing of that kind..

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
(prin1 "Jane dates only Lisp programmers"))
From: Alan Crowe
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <86eknhvyyl.fsf@cawtech.freeserve.co.uk>
Frank Buss wrote:

> But you need not to use this little complicated function,
> because CLISP has the non-standard function
> EXT:WRITE-FLOAT, which "writes a floating-point number in
> IEEE 754 binary representation":

> http://clisp.cons.org/impnotes/stream-dict.html#bin-output

> I assume other Lisp implementations have something similiar.

In CMUCL

(format nil "~X" (kernel:single-float-bits 1.0)
"3F800000"

There is also KERNEL:DOUBLE-FLOAT-HIGH-BITS and
KERNEL:DOUBLE-FLOAT-LOW-BITS

--
Alan Crowe
Edinburgh
Scotland
From: R. Mattes
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <pan.2004.07.12.14.43.04.480494.4229@mh-freiburg.de>
On Mon, 12 Jul 2004 02:19:03 +0200, Alex Mizrahi wrote:

> Hello, All!
> 
> i have 3mb long XML document with about 150000 lines (i think it has
> about 200000 elements there) which i want to parse to DOM to work with.
> first i thought there will be no problems, but there were.. first i
> tried Python.. there's special interest group that wants to "make Python
> become the premier language for XML processing" so i thought there will
> be no problems with this document.. i used xml.dom.minidom to parse it..
> after it ate 400 meg of RAM i killed it - i don't want such processing..
> i think this is because of that fat class impementation - possibly
> Python had some significant overhead for each object instance, or
> something like this..

First of all: which parser did you actually use? There are quite a number
of XML parsers for python. I personally use the libxml2 one and never had
memory proplems like you describe.

> then i asdf-installed s-xml package and tried it with it. it ate only 25
> megs for lxml representation. i think interning element names helped a
> lot.. it was CLISP that has unicode inside, so i think it could be even
> less without unicode..

Hmmm. Hmmm ... i guess you know that you compare apples with pears? 
S-XML is a nice, small parser but nowhere near a standard conformant 
XML parser. Have a closer look at the webpage: no handling of character
encoding, no CDATA, can't handle what the author calls "special tags" 
(like processing instruction), no schema/DTD support, and, most
important, no namespace support!

> then i tried C++ - TinyXML. it was fast, but ate 65 megs.. ye, looks
> like interning helps a lot 8-]

Interning is _much_ easier without namespaces.
 
> then i tried Perl XML::DOM.. it was better than python - about 180megs,
> but it was slowest.. at least it consumed mem slower than python 8-]
> 
> and java.. with default parser it took 45mbs.. maybe it interned
> strings, but there was overhead from classes - storing trees is
> definitely what's lisp optimized for 8-]

But you never got to a _full_ DOM with you lxml parsing. What you got was
a list-of-lists. There's no 'parent' implementation for your lxml
elements (which means that you might need to path the whole thing
arround all the time).

If you want a serious comparison you either need to compare s-xml with
similar "lightweight" parsers in Perl/Python/Ruby etc. or write your own
fully DOM compliant parser in LISP (or is there one allready? I'm still
looking for a good one).

 Just my 0.02 $

   Ralf Mattes


> so lisp is winner.. but it has not standard way (even no non-standard
> but simple) way to write binary IEEE floating point representation, so
> common lisp suck and i will use c++ for my task.. 8-]]]
> 
> With best regards, Alex 'killer_storm' Mizrahi.
From: Peter Hansen
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <FeKdnQrhA-uYDW_dRVn-gQ@powergate.ca>
R. Mattes wrote:

> But you never got to a _full_ DOM with you lxml parsing. What you got was
> a list-of-lists. There's no 'parent' implementation for your lxml
> elements (which means that you might need to path the whole thing
> arround all the time).
> 
> If you want a serious comparison you either need to compare s-xml with
> similar "lightweight" parsers in Perl/Python/Ruby etc. 

If that's what he wants, it would be called PyRXP in the Python world,
a Python wrapper around the RXP library, available from 
http://www.reportlab.com/ .  Has a much smaller footprint than the
DOM representations, as you would expect, and lightning fast.

-Peter
From: Cameron Laird
Subject: Lisp certainly can handle numerics (was: lisp is winner in DOM parsing contest! 8-])
Date: 
Message-ID: <82qas1-81p.ln1@lairds.us>
In article <··············@uni-berlin.de>,
Alex Mizrahi <········@hotmail.com> wrote:
			.
			.
			.
>so lisp is winner.. but it has not standard way (even no non-standard but
>simple) way to write binary IEEE floating point representation, so common
>lisp suck and i will use c++ for my task.. 8-]]]
			.
			.
			.
I'm having trouble following this thread.  In isolation, though,
this claim *can't* be true.  Is there a particular Lisp implemen-
tation you have in mind?
From: Uche Ogbuji
Subject: Re: lisp is winner in DOM parsing contest! 8-]
Date: 
Message-ID: <d116fbae.0407160840.5746e98d@posting.google.com>
"Alex Mizrahi" <········@hotmail.com> wrote in message news:<··············@uni-berlin.de>...
> Hello, All!
> 
> i have 3mb long XML document with about 150000 lines (i think it has about
> 200000 elements there) which i want to parse to DOM to work with.
> first i thought there will be no problems, but there were..
> first i tried Python.. there's special interest group that wants to "make
> Python become the premier language for XML processing" so i thought there
> will be no problems with this document..
> i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i killed
> it - i don't want such processing.. i think this is because of that fat
> class impementation - possibly Python had some significant overhead for each
> object instance, or something like this..

minidom has about a 60X - 80X load factor on average (comparing XML
file size to memory working set).  You're claiming you saw a 130X load
factor.  That sounds odd  Are there special characteristics of your
document you're not mentioning?

cDomlette, part of 4Suite, only has a 10X load factor, so I'd guess
your example would end up with a 30MB memory working set.  cDomlette
does use string interning, as one example of optimization techniques. 
4Suite also provides you XSLT, XPath, RELAX NG and some other
processing goodies.

See:

http://4suite.org/
http://www.xml.com/pub/a/2002/10/16/py-xml.html
http://uche.ogbuji.net/akara/nodes/2003-01-01/domlettes?xslt=/akara/akara.xslt

As you can see, cDomlette is as DOM-like as minidom, so very easy to
use (for the record neither is a compliant W3C DOM implementation).

Also, in my next XML.com article I cover a technique that uses SAX to
break large XML documents into series of small DOMs, one after the
other, so that the memory penalty is *very* low, depending on your
document structure.  It works with any DOM implementation that meets
the Python DOM binding, including minidom and cDomlette.

-- 
Uche
http://uche.ogbuji.net