Re: parsing HTML using LISP

From: Sunil Mishra
Subject: Re: parsing HTML using LISP
Date: Thu, 02 Oct 1997 00:00:00 +0000
Message-ID: <efyyb4c3v32.fsf@yorktown.cc.gatech.edu>

In article <·············@WINTERMUTE.eagle> SDS <···········@cctrading.com> writes:

   I am pretty sure that someone has implemented HTML parsing using LISP
   already, so I would appreciate some pointers/hints.

   More precisely, I can envision 3 levels to the solution of the problem:

   1. Discard the markup completely, returning the stream of object which
   are not within the markup. E.g., successive reads on the stream reading
   from the string 
	   "<head><t1>title</t1></head><a href="http://www.net">link</a>"
   produces 2 symbols (better yet, strings, but symbols are okay): 'title
   and 'link. I think this can be accomplished using readtables, but the
   following doesn't seem to work (why?):

   (set-macro-character #\< #'read-html-markup)
   (set-macro-character #\> (get-macro-character #\) nil))

   (defun read-html-markup (stream char)
     "Skip through the HTML markup."
     (declare (ignore char))
     (do () ((char= (read-char stream t nil t) #\>))))

   I am mostly interested in a quick-n-dirty solution of type 1.

   Also, Eric asked some time ago about the correct way to represent a URL
   in LISP, and, as far as I understood, the general mood was that logical
   pathnames are not the appropriate vehicle. What is? Does one have to
   create a special class/struct for that? (AFAIK, a URL can contain a lot
   or information: "protocol://[user[#password]@]host[:port][/[path]]").
   Are there routines available to parse such a monster?
   I assume that I would have to write one myself, so...
   How do I read from a stream into a string until a non-URL-constituent
   character is encountered?
   (with-output-to-string (st)
      (do (zz) ((non-url-constituent (setq zz (read-char whatever))) st)
	   (princ zz st)))
   ?? (how do I check for a char being url-constituent?)

   Thanks!

   -- 
   Sam Steingold

I have implemented an HTML parser that is currently being distributed with
CL-HTTP (available at ftp.ai.mit.edu/pub/cl-http). The latest version is
being distributed with the last release in the devo subdirectory. It is a
full parser, but customizable to the point where you can tell it to ignore
anything but text, or specific tags, and so on. Be warned that the version
being distributed has a few bugs (which I have since fixed), but nothing
that should get in the way of extracting text.

More documentation is provided with the parser.

Sunil