Parsing hairy HTML - which tool/library

From: vedm
Subject: Parsing hairy HTML - which tool/library
Date: Mon, 31 Jul 2006 15:12:36 +0000
Message-ID: <VamdnQoO1NR-gFPZnZ2dnUVZ_sSdnZ2d@giganews.com>

I need to parse ill-formed HTML. I want something like a SAX parser
which reports starting and ending of elements.

I tried pxmlutils, but it doesn't do exactly start/end element reporting,
instead it gives you the whole HTML element in the form of a list.

I tried CXML too - it works great with well formed XML but chokes on ill
formed HTML.

Any suggestions?

-- 
vedm

Re: Parsing hairy HTML - which tool/library Marijn
Re: Parsing hairy HTML - which tool/library sross
- Re: Parsing hairy HTML - which tool/library vedm

From: Marijn
Subject: Re: Parsing hairy HTML - which tool/library
Date: Mon, 31 Jul 2006 15:49:36 +0000
Message-ID: <1154360976.649494.89760@p79g2000cwp.googlegroups.com>

> I need to parse ill-formed HTML. I want something like a SAX parser
> which reports starting and ending of elements.
>
> I tried pxmlutils, but it doesn't do exactly start/end element reporting,
> instead it gives you the whole HTML element in the form of a list.

The solution we used was to externally call tidy (tidy.sourceforge.net)
to transform the messy HTML into an XML file. Tidy does a good job of
guessing what the HTML should have looked like, and you can parse the
resulting XML however you want.

With a bit more work I guess it wouldn't be hard to slap a lisp
interface onto tidy.


Marijn Haverbeke

From: sross
Subject: Re: Parsing hairy HTML - which tool/library
Date: Mon, 31 Jul 2006 20:01:46 +0000
Message-ID: <1154376106.544756.282790@b28g2000cwb.googlegroups.com>

vedm wrote:
> I need to parse ill-formed HTML. I want something like a SAX parser
> which reports starting and ending of elements.
>
> I tried pxmlutils, but it doesn't do exactly start/end element reporting,
> instead it gives you the whole HTML element in the form of a list.
>
> I tried CXML too - it works great with well formed XML but chokes on ill
> formed HTML.
>
> Any suggestions?

Not too sure if it precisely what you are after but there is
a cl-html-parse project on cliki.net

Sean.

From: vedm
Subject: Re: Parsing hairy HTML - which tool/library
Date: Mon, 31 Jul 2006 21:08:20 +0000
Message-ID: <_KudnSoyk5ze7FPZnZ2dnUVZ_r-dnZ2d@giganews.com>

"sross" <······@gmail.com> writes:

> vedm wrote:
>> I need to parse ill-formed HTML. I want something like a SAX parser
>> which reports starting and ending of elements.
>>
>> I tried pxmlutils, but it doesn't do exactly start/end element reporting,
>> instead it gives you the whole HTML element in the form of a list.
>>
>> I tried CXML too - it works great with well formed XML but chokes on ill
>> formed HTML.
>>
>> Any suggestions?
>
> Not too sure if it precisely what you are after but there is
> a cl-html-parse project on cliki.net

I know about it, but it is exactly like pxmlutils - a port of Franz's
xmlutils. So I've tried it already.

I think I will use either CyberNeko HTML Parser
(http://people.apache.org/~andyc/neko/doc/html/index.html) or
htmltidy. There is no similar Lisp library, at least I did not find any.

-- 
vedm