In article <·············@WINTERMUTE.eagle> SDS <···········@cctrading.com> writes:
I am pretty sure that someone has implemented HTML parsing using LISP
already, so I would appreciate some pointers/hints.
More precisely, I can envision 3 levels to the solution of the problem:
1. Discard the markup completely, returning the stream of object which
are not within the markup. E.g., successive reads on the stream reading
from the string
"<head><t1>title</t1></head><a href="http://www.net">link</a>"
produces 2 symbols (better yet, strings, but symbols are okay): 'title
and 'link. I think this can be accomplished using readtables, but the
following doesn't seem to work (why?):
(set-macro-character #\< #'read-html-markup)
(set-macro-character #\> (get-macro-character #\) nil))
(defun read-html-markup (stream char)
"Skip through the HTML markup."
(declare (ignore char))
(do () ((char= (read-char stream t nil t) #\>))))
I am mostly interested in a quick-n-dirty solution of type 1.
Also, Eric asked some time ago about the correct way to represent a URL
in LISP, and, as far as I understood, the general mood was that logical
pathnames are not the appropriate vehicle. What is? Does one have to
create a special class/struct for that? (AFAIK, a URL can contain a lot
or information: "protocol://[user[#password]@]host[:port][/[path]]").
Are there routines available to parse such a monster?
I assume that I would have to write one myself, so...
How do I read from a stream into a string until a non-URL-constituent
character is encountered?
(with-output-to-string (st)
(do (zz) ((non-url-constituent (setq zz (read-char whatever))) st)
(princ zz st)))
?? (how do I check for a char being url-constituent?)
Thanks!
--
Sam Steingold
I have implemented an HTML parser that is currently being distributed with
CL-HTTP (available at ftp.ai.mit.edu/pub/cl-http). The latest version is
being distributed with the last release in the devo subdirectory. It is a
full parser, but customizable to the point where you can tell it to ignore
anything but text, or specific tags, and so on. Be warned that the version
being distributed has a few bugs (which I have since fixed), but nothing
that should get in the way of extracting text.
More documentation is provided with the parser.
Sunil