HTML/DOM parser

From: Xah Lee
Subject: HTML/DOM parser
Date: Tue, 28 Feb 2006 08:34:42 +0000
Message-ID: <1141115682.916858.318680@j33g2000cwa.googlegroups.com>

is there a library in lisp/scheme that lets me parse validated html
files and store it as a tree?

for example, i want to be able to easily, say, replace the following

<hr><p>References</p>
<pre>
• <a href=a>a...</a>
...
</pre>

to

<hr><p>References</p>
<ul>
<li><a href=a>a...</a></li>
...
</ul>

Thanks.

   Xah
   ···@xahlee.org
 ∑ http://xahlee.org/

Re: HTML/DOM parser Pascal Costanza
- Re: HTML/DOM parser ·······@gmail.com
  - Re: HTML/DOM parser Pascal Bourguignon
  - Re: HTML/DOM parser Marco Baringer
  - Re: HTML/DOM parser Jens Axel Søgaard

From: Pascal Costanza
Subject: Re: HTML/DOM parser
Date: Tue, 28 Feb 2006 08:41:02 +0000
Message-ID: <46igkvFas14oU1@individual.net>

Xah Lee wrote:
> is there a library in lisp/scheme that lets me parse validated html
> files and store it as a tree?

Common Lisp: http://www.cl-user.net/asp/search?search=XML&b=Search
Scheme: http://okmij.org/ftp/Scheme/xml.html

Pascal

-- 
My website: http://p-cos.net
Closer to MOP & ContextL:
http://common-lisp.net/project/closer/

From: ·······@gmail.com
Subject: Re: HTML/DOM parser
Date: Wed, 01 Mar 2006 04:58:59 +0000
Message-ID: <1141189139.937915.157570@i40g2000cwc.googlegroups.com>

But valid HTML may not be valid XML (i.e. if not valid XHTML).  And
HTML in the wild may not even be valid HTML.  The example in the little
excerpt given is the <hr> tag.  This is not closed, and would cause an
XML parser to choke.

I've wondered if there was a "friendly" parser in Lisp that will take
say in-the-wild HTML "tag soup" and produce a semi-faithful DOM tree
from it (or just a proper XHTML version) that can then be addressed
with XPath and such.  An example in Python is Beautiful Soup
(http://www.crummy.com/software/BeautifulSoup/).

But my time to actually use it for what I want (which may well be
easier done in Python anyway) is so little that I haven't bothered to
investigate or ask.  But the topic is sort of close here, so putting
the question out can't hurt too much.

From: Pascal Bourguignon
Subject: Re: HTML/DOM parser
Date: Wed, 01 Mar 2006 05:22:27 +0000
Message-ID: <87y7zu20z0.fsf@thalassa.informatimago.com>

·······@gmail.com writes:

> But valid HTML may not be valid XML (i.e. if not valid XHTML).  And
> HTML in the wild may not even be valid HTML.  The example in the little
> excerpt given is the <hr> tag.  This is not closed, and would cause an
> XML parser to choke.

Most HTML in the wild is totally fucked up.
I'm not even sure one set of heuristics can parse usefully _all_ of them.

IMO, the best you can do, is consider the HTML you have to parse, and
write a specific parser.  Of course, you can start from a generic
"standard" parser, and adapt it to parse the HTML bugs you have, or
correct the HTML bugs in the HTML source, if you can.

(Don't even hope to be able to identify all the tags algorithmically!
There are often single occurences of double-quote inside a tag, not to
speak of less benign horrors).

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

From: Marco Baringer
Subject: Re: HTML/DOM parser
Date: Wed, 01 Mar 2006 10:33:17 +0000
Message-ID: <m2mzgav4ia.fsf@bese.it>

·······@gmail.com writes:

> But valid HTML may not be valid XML (i.e. if not valid XHTML).  And
> HTML in the wild may not even be valid HTML.  The example in the little
> excerpt given is the <hr> tag.  This is not closed, and would cause an
> XML parser to choke.
>
> I've wondered if there was a "friendly" parser in Lisp that will take
> say in-the-wild HTML "tag soup" and produce a semi-faithful DOM tree
> from it (or just a proper XHTML version) that can then be addressed
> with XPath and such.  An example in Python is Beautiful Soup
> (http://www.crummy.com/software/BeautifulSoup/).

ftp://ftp.common-lisp.net/pub/project/bese/pxmlutils_1.0.tar.gz

contains a port of franz's excellent xmlutils packgae which contians,
other than the xml parser, a pretty good anything-goes html parser
(see the phtml.cl file). i've used it successfully to scrape all kinds
of web pages.

-- 
-Marco
Ring the bells that still can ring.
Forget the perfect offering.
There is a crack in everything.
That's how the light gets in.
	-Leonard Cohen

From: Jens Axel Søgaard
Subject: Re: HTML/DOM parser
Date: Wed, 01 Mar 2006 11:34:12 +0000
Message-ID: <440586b4$0$38686$edfadb0f@dread12.news.tele.dk>

·······@gmail.com wrote:
> But valid HTML may not be valid XML (i.e. if not valid XHTML).  And
> HTML in the wild may not even be valid HTML.  The example in the little
> excerpt given is the <hr> tag.  This is not closed, and would cause an
> XML parser to choke.
> 
> I've wondered if there was a "friendly" parser in Lisp that will take
> say in-the-wild HTML "tag soup" and produce a semi-faithful DOM tree
> from it (or just a proper XHTML version) that can then be addressed
> with XPath and such.  An example in Python is Beautiful Soup
> (http://www.crummy.com/software/BeautifulSoup/).

Neil van Dyke has written HtmlPrag for Scheme.

     <http://www.neilvandyke.org/htmlprag/>

-- 
Jens Axel S�gaard