From: Christopher C. Stacy
Subject: HTML browsing/parsing
Date: 
Message-ID: <usm6lv9b1.fsf@news.dtpq.com>
What software are people using for client-side web applications?
I want to do some spidering and screen scraping, and am hoping
not to have to write everything from scratch.
From: Edi Weitz
Subject: Re: HTML browsing/parsing
Date: 
Message-ID: <uy8gd33zd.fsf@agharta.de>
On Sun, 05 Dec 2004 01:10:58 GMT, ······@news.dtpq.com (Christopher C. Stacy) wrote:

> What software are people using for client-side web applications?  I
> want to do some spidering and screen scraping, and am hoping not to
> have to write everything from scratch.

(Portable) AllegroServe contains client functions[1] to retrieve web
pages with support for https, authorization, cookies, redirects,
etc. If you only need a subset of these facilities it's of course
quite easy to do it yourself - email me privately if you want me to
send you my client code for LispWorks/CMUCL.

For HTML parsing you can use Franz' open source HTML parser[2] or else
you might want to look at URL-REWRITE[3] which is quite small and also
contains a simple HTML parser (which is integrated into the rest of
the application but should be easy to pull out of it).

I've also used regular expressions to parse information out of
specific websites. This is of course not as general as full HTML
parsing but can be quite efficient. It helps to have Perl regex
syntax[4] (look-ahead, look-behind, etc.) for this task.

OK, enough advertisement for today... :)

Edi.

[1] <http://opensource.franz.com/aserve/aserve-dist/doc/aserve.html#cliient-request>
[2] <http://opensource.franz.com/xmlutils/xmlutils-dist/phtml.htm>
[3] <http://weitz.de/url-rewrite/>
[4] <http://weitz.de/cl-ppcre/>

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")