Re: parsing HTML using LISP

From: Richard Tietjen
Subject: Re: parsing HTML using LISP
Date: Tue, 30 Sep 1997 00:00:00 +0000
Message-ID: <87vhzi19sp.fsf@kale.connix.com>

SDS <···········@cctrading.com> writes:

Joachim Schrod created something called STIL written for CLISP, using
CLOS.  It reads the stream of events from nsgmls, James Clark's sgml
parser, so you need Clark's sp distribution which is easy to find, and
the appropriate HTML DTD.  I got it from ftp.th-darmstadt.de/pub/ I
believe.

> 
> I am pretty sure that someone has implemented HTML parsing using LISP
> already, so I would appreciate some pointers/hints.
> 
> More precisely, I can envision 3 levels to the solution of the problem:
> 
> 1. Discard the markup completely, returning the stream of object which
> are not within the markup. E.g., successive reads on the stream reading
> from the string 
> 	"<head><t1>title</t1></head><a href="http://www.net">link</a>"
> produces 2 symbols (better yet, strings, but symbols are okay): 'title
> and 'link. I think this can be accomplished using readtables, but the
> following doesn't seem to work (why?):
> 
> (set-macro-character #\< #'read-html-markup)
> (set-macro-character #\> (get-macro-character #\) nil))
>   
> (defun read-html-markup (stream char)
>   "Skip through the HTML markup."
>   (declare (ignore char))
>   (do () ((char= (read-char stream t nil t) #\>))))
> 
> 2. Read the tags only, i.e., dump <b> but keep <a href="">, returning
> the URL only.
> 
> 3. *Really* parse HTML, attaching appropriate properties to the tokens
> read (whatever a "bold" string might mean). Say, each token read from
> the stream (thus it is *outside* of any markup) has a plist listing its
> attributes, like boldness, href etc.
> 
> I am mostly interested in a quick-n-dirty solution of type 1.
> 
> Also, Eric asked some time ago about the correct way to represent a URL
> in LISP, and, as far as I understood, the general mood was that logical
> pathnames are not the appropriate vehicle. What is? Does one have to
> create a special class/struct for that? (AFAIK, a URL can contain a lot
> or information: "protocol://[user[#password]@]host[:port][/[path]]").
> Are there routines available to parse such a monster?
> I assume that I would have to write one myself, so...
> How do I read from a stream into a string until a non-URL-constituent
> character is encountered?
> (with-output-to-string (st)
>    (do (zz) ((non-url-constituent (setq zz (read-char whatever))) st)
> 	(princ zz st)))
> ?? (how do I check for a char being url-constituent?)
> 
> Thanks!
> 
> -- 
> Sam Steingold