I need to parse ill-formed HTML. I want something like a SAX parser
which reports starting and ending of elements.
I tried pxmlutils, but it doesn't do exactly start/end element reporting,
instead it gives you the whole HTML element in the form of a list.
I tried CXML too - it works great with well formed XML but chokes on ill
formed HTML.
Any suggestions?
--
vedm
> I need to parse ill-formed HTML. I want something like a SAX parser
> which reports starting and ending of elements.
>
> I tried pxmlutils, but it doesn't do exactly start/end element reporting,
> instead it gives you the whole HTML element in the form of a list.
The solution we used was to externally call tidy (tidy.sourceforge.net)
to transform the messy HTML into an XML file. Tidy does a good job of
guessing what the HTML should have looked like, and you can parse the
resulting XML however you want.
With a bit more work I guess it wouldn't be hard to slap a lisp
interface onto tidy.
Marijn Haverbeke
vedm wrote:
> I need to parse ill-formed HTML. I want something like a SAX parser
> which reports starting and ending of elements.
>
> I tried pxmlutils, but it doesn't do exactly start/end element reporting,
> instead it gives you the whole HTML element in the form of a list.
>
> I tried CXML too - it works great with well formed XML but chokes on ill
> formed HTML.
>
> Any suggestions?
Not too sure if it precisely what you are after but there is
a cl-html-parse project on cliki.net
Sean.
"sross" <······@gmail.com> writes:
> vedm wrote:
>> I need to parse ill-formed HTML. I want something like a SAX parser
>> which reports starting and ending of elements.
>>
>> I tried pxmlutils, but it doesn't do exactly start/end element reporting,
>> instead it gives you the whole HTML element in the form of a list.
>>
>> I tried CXML too - it works great with well formed XML but chokes on ill
>> formed HTML.
>>
>> Any suggestions?
>
> Not too sure if it precisely what you are after but there is
> a cl-html-parse project on cliki.net
I know about it, but it is exactly like pxmlutils - a port of Franz's
xmlutils. So I've tried it already.
I think I will use either CyberNeko HTML Parser
(http://people.apache.org/~andyc/neko/doc/html/index.html) or
htmltidy. There is no similar Lisp library, at least I did not find any.
--
vedm