Free CL sourcecode to parse HTML, RFC822 headers, etc.?

From: ·······@Yahoo.Com
Subject: Free CL sourcecode to parse HTML, RFC822 headers, etc.?
Date: Sat, 26 Jan 2002 20:07:42 +0000
Message-ID: <ffdf399b.0201261207.5ab6159e@posting.google.com>

Does anybody know of any public-domain or other freely
available Common-LISP sourcecode for parsing HTML files into
a structured form, for example <bold>text</bold> should
parse as something like (bold "text") instead of as
(start-bold)"text"(end-bold), and <a href="url">text</a>
should parse as something like (href "url" "text") instead
of as (start-href "url") "text" (end-href) likewise all
other semantically nested meaning should be represented as
nested s-expressions rather than sequences of start and end
forms as the original HTML shows?

If such is not available online, how about just the surface
parse as a sequence of start-something and end-something,
which a subsequent LISP program could then convert to nested
form?

Does anybody know of any public-domain or other freely
available Common-LISP sourcecode for parsing e-mail headers,
such as Received lines correctly parsed into the from part
and the by part with each part properly nested?

More generally, does there exist any archive of
freely-available Common-LISP sourcecode which is indexed by
topic (kind of task performed, kind of data processed) and
for which there's a WebServer application that is a search
engine whereby somebody can type in a description such as I
gave in the first and third paragraphs above and the server
then automatically finds the most relevant matches?

If a fully indexed LISP-source archive with search engine
isn't available, then what about just a LISP-source archive
with an index by keywords or topics that is easily readable
by a LISP program so that I could write the search engine
myself?

Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.? Alexey Dejneka
- Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.? Pierre R. Mai
  - Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.? Thaddeus L Olczyk
    - Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.? Pierre R. Mai
    - Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.? Bijan Parsia
  - Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.? Michael Hudson
  - Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.? Bijan Parsia
    - Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.? Brian P Templeton
Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.? Paolo Amoroso
Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.? ·······@inetmi.com

From: Alexey Dejneka
Subject: Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.?
Date: Sun, 27 Jan 2002 06:55:14 +0000
Message-ID: <m3pu3wxoz1.fsf@comail.ru>

·······@Yahoo.Com writes:

> Does anybody know of any public-domain or other freely
> available Common-LISP sourcecode for parsing HTML files

http://opensource.franz.com/xmlutils

> More generally, does there exist any archive of
> freely-available Common-LISP sourcecode which is indexed by
> topic

http://ww.telent.net/cliki

-- 
Regards,
Alexey Dejneka

From: Pierre R. Mai
Subject: Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.?
Date: Sun, 27 Jan 2002 16:07:25 +0000
Message-ID: <87zo2z3hhe.fsf@orion.bln.pmsf.de>

Alexey Dejneka <········@comail.ru> writes:

> ·······@Yahoo.Com writes:
> 
> > Does anybody know of any public-domain or other freely
> > available Common-LISP sourcecode for parsing HTML files
> 
> http://opensource.franz.com/xmlutils

Note though that although parsing HTML should be a simple application
of a generic SGML (not XML, for that you need XHTML) parser, unless
you generate the HTML files yourself, it isn't:  There is all kind of
crap out there in the wild, that purports to be HTML (i.e. conforming
to some HTML DTD), and through heroic efforts on the part of the
browser writers does render OK.  So unless you can afford to tell 99%
percent of your users "Error: This is not HTML!", you'll need to handle
all sorts of erroneous stuff, like mis-matched start end end tags,
etc.  Generally parsing real-world HTML is probably akin to heuristic
natural language parsing, than it is to SGML parsing.

Regs, Pierre.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein

From: Thaddeus L Olczyk
Subject: Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.?
Date: Sun, 27 Jan 2002 17:52:31 +0000
Message-ID: <3c553d37.1920671@nntp.interaccess.com>

On 27 Jan 2002 17:07:25 +0100, "Pierre R. Mai" <····@acm.org> wrote:

>Alexey Dejneka <········@comail.ru> writes:
>
>> ·······@Yahoo.Com writes:
>> 
>> > Does anybody know of any public-domain or other freely
>> > available Common-LISP sourcecode for parsing HTML files
>> 
>> http://opensource.franz.com/xmlutils
>
>Note though that although parsing HTML should be a simple application
>of a generic SGML (not XML, for that you need XHTML) parser, unless
>you generate the HTML files yourself, it isn't:  There is all kind of
>crap out there in the wild, that purports to be HTML (i.e. conforming
>to some HTML DTD),
Wrong DTD implies either SGML of XML. HTML is not expressible in
either.
> and through heroic efforts on the part of the
>browser writers does render OK.  So unless you can afford to tell 99%
>percent of your users "Error: This is not HTML!", you'll need to handle
>all sorts of erroneous stuff, like mis-matched start end end tags,
>etc. 
Bullshit. None of this is erroneous. From the perspective of
SGML or XML it is. From the perspective of HTML it is perfectly legal.
> Generally parsing real-world HTML is probably akin to heuristic
>natural language parsing, than it is to SGML parsing.
>
If you have an XML parser and not a real HTML parser, then you 
probably would want to go to w3.org . There is a utility there that
translates from HTML to XHTML.

From: Pierre R. Mai
Subject: Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.?
Date: Sun, 27 Jan 2002 19:12:32 +0000
Message-ID: <87lmej38wv.fsf@orion.bln.pmsf.de>

······@interaccess.com (Thaddeus L Olczyk) writes:

> On 27 Jan 2002 17:07:25 +0100, "Pierre R. Mai" <····@acm.org> wrote:
> 
> >Alexey Dejneka <········@comail.ru> writes:
> >
> >> ·······@Yahoo.Com writes:
> >> 
> >> > Does anybody know of any public-domain or other freely
> >> > available Common-LISP sourcecode for parsing HTML files
> >> 
> >> http://opensource.franz.com/xmlutils
> >
> >Note though that although parsing HTML should be a simple application
> >of a generic SGML (not XML, for that you need XHTML) parser, unless
> >you generate the HTML files yourself, it isn't:  There is all kind of
> >crap out there in the wild, that purports to be HTML (i.e. conforming
> >to some HTML DTD),
> Wrong DTD implies either SGML of XML. HTML is not expressible in
> either.

Did you actually look at the relevant W3C recommendations and RFCs?

First of all:  I never claimed that DTD "implies" SGML, because that
would be utter rubbish.  It is just a fact that since 1994 (at least)
HTML is defined as an SGML application, with reference to an SGML DTD.
To quote from RFC 1866:

<quote>
   HTML is an application of ISO Standard 8879:1986, "Information
   Processing Text and Office Systems; Standard Generalized Markup
   Language" (SGML). The HTML Document Type Definition (DTD) is a formal
   definition of the HTML syntax in terms of SGML.

   This specification also defines HTML as an Internet Media
   Type[IMEDIA] and MIME Content Type[MIME] called `text/html'. As such,
   it defines the semantics of the HTML syntax and how that syntax
   should be interpreted by user agents.
</quote>

<quote>
   A document is a conforming HTML document if:

        * It is a conforming SGML document, and it conforms to the
        HTML DTD (see 9.1, "HTML DTD").

  [...]
</quote>

Similar language can be found in all relevant W3C recommendations,
too.

So to restate the obvious: _Valid_ HTML should be parsable with any
conforming SGML parser.  You can't parse valid HTML with an XML
parser, because HTML makes use of features of SGML which are not
present in XML.  However using a SGML parser, it is fairly simple to
convert valid HTML to valid XHTML, which can be parsed with any
conforming XML parser.

But that isn't the issue here at all:  "HTML" as found on the WWW is
very seldom conforming to any kind of standard DTD (and isn't even
well-formed in the XML sense):

> >all sorts of erroneous stuff, like mis-matched start end end tags,
> >etc. 
> Bullshit. None of this is erroneous. From the perspective of
> SGML or XML it is. From the perspective of HTML it is perfectly legal.

So you consider e.g.

<b><em>mismatch here:</b></em>

to be a perfectly legal HTML document fragment[1]?  Since none of the W3C
recommendations I'm aware of, nor RFC 1866 declares this to be valid,
you can surely point me to a specification for "HTML" that does.

> > Generally parsing real-world HTML is probably akin to heuristic
> >natural language parsing, than it is to SGML parsing.
> >
> If you have an XML parser and not a real HTML parser, then you 
> probably would want to go to w3.org . There is a utility there that
> translates from HTML to XHTML.

That is a completely different problem, see above.

All in all I'd be very happy if you could tone down your language, and
give references which back your wild claims, instead.

Regs, Pierre.

Footnotes: 
[1]  And yes, you will encounter lots of that kind of stuff in
     user-written HTML (and even some that's generated by borken HTML
     editors/generators).  While the general situation has improved
     somewhat in recent years, it is still abismal.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein

From: Bijan Parsia
Subject: Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.?
Date: Tue, 29 Jan 2002 14:15:53 +0000
Message-ID: <Pine.A41.4.21L1.0201290909160.32478-100000@login7.isis.unc.edu>

On Sun, 27 Jan 2002, Thaddeus L Olczyk wrote:

> On 27 Jan 2002 17:07:25 +0100, "Pierre R. Mai" <····@acm.org> wrote:
[snip]
> >Note though that although parsing HTML should be a simple application
> >of a generic SGML (not XML, for that you need XHTML) parser, unless
> >you generate the HTML files yourself, it isn't:  There is all kind of
> >crap out there in the wild, that purports to be HTML (i.e. conforming
> >to some HTML DTD),
> Wrong DTD implies either SGML of XML. HTML is not expressible in
> either.

Huh? Do you mean "HTML files as they are really found in nature"? After
all, not only *can* HTML be modeled by a DTD, it *is* so modeled and
always has been. See the specs.

Syntax only, of course. DTD don't say a thing about semantics.

> > and through heroic efforts on the part of the
> >browser writers does render OK.  So unless you can afford to tell 99%
> >percent of your users "Error: This is not HTML!", you'll need to handle
> >all sorts of erroneous stuff, like mis-matched start end end tags,
> >etc. 
> Bullshit. None of this is erroneous. From the perspective of
> SGML or XML it is. From the perspective of HTML it is perfectly legal.

No, you're seriously confused. Sorry.

The only thing that determines what's *legal* HTML are the specs. See my
other post for an analysis of this example.

Not that SGML allows for ommitting certain close tags...unlike XML. SGML
is way more permissive.

> > Generally parsing real-world HTML is probably akin to heuristic
> >natural language parsing, than it is to SGML parsing.
> 
> If you have an XML parser and not a real HTML parser, then you 
> probably would want to go to w3.org . There is a utility there that
> translates from HTML to XHTML.

HTML Tidy, as I mentioned. It also cleans up regular HTML without
converting.

Hmm. Maybe you're confusing a "real" HTML parser with an HTML parser with
excellent error handling? An SGML parser than can read in the HTML DTD (of
whichever version) is as real an HTML parser as you can get. If the HTML
doesn't validate it's not legal SGML *or* HTML (HTML being a SGML
application). That doesn't mean it's not *usable* HTML, or that a browser
should just throw an error.

Cheers,
Bijan Parsia.

From: Michael Hudson
Subject: Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.?
Date: Mon, 28 Jan 2002 16:03:38 +0000
Message-ID: <uu1t631g6.fsf@python.net>

"Pierre R. Mai" <····@acm.org> writes:

> Alexey Dejneka <········@comail.ru> writes:
> 
> > ·······@Yahoo.Com writes:
> > 
> > > Does anybody know of any public-domain or other freely
> > > available Common-LISP sourcecode for parsing HTML files
> > 
> > http://opensource.franz.com/xmlutils
> 
> Note though that although parsing HTML should be a simple application
> of a generic SGML (not XML, for that you need XHTML) parser, unless
> you generate the HTML files yourself, it isn't:  There is all kind of
> crap out there in the wild, that purports to be HTML (i.e. conforming
> to some HTML DTD), and through heroic efforts on the part of the
> browser writers does render OK.  So unless you can afford to tell 99%
> percent of your users "Error: This is not HTML!", you'll need to handle
> all sorts of erroneous stuff, like mis-matched start end end tags,
> etc.  Generally parsing real-world HTML is probably akin to heuristic
> natural language parsing, than it is to SGML parsing.

However, I believe stuffing crap-du-jour html found on the web through
tidy (http://tidy.sf.net) usually results in something an SGML parser
won't choke on.

Cheers,
M.

-- 
  In that case I suggest that to get the correct image you look at
  the screen from inside the monitor whilst standing on your head.  
               -- James Bonfield, http://www.ioccc.org/2000/rince.hint

From: Bijan Parsia
Subject: Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.?
Date: Tue, 29 Jan 2002 14:08:08 +0000
Message-ID: <Pine.A41.4.21L1.0201290901190.32478-100000@login7.isis.unc.edu>

On 27 Jan 2002, Pierre R. Mai wrote:

[snip]
> Note though that although parsing HTML should be a simple application
> of a generic SGML (not XML, for that you need XHTML) parser, unless
> you generate the HTML files yourself, it isn't:  There is all kind of
> crap out there in the wild, that purports to be HTML (i.e. conforming
> to some HTML DTD), and through heroic efforts on the part of the
> browser writers does render OK.

I think it's a little more complicated. I.e., the standard behavior of
"ignoring tags you don't understand" (a SGML trope, I believe) conceals a
multitude of sins, at least for viewing purpose. Plus, the browser parsers
*introduced* several pervasive bugs (Netscape comment handling, is an
example, I forget for which versions).

>  So unless you can afford to tell 99%
> percent of your users "Error: This is not HTML!", you'll need to handle
> all sorts of erroneous stuff, like mis-matched start end end tags,
> etc.  

Actually, for a fair number of cases, that's not so bad, because of
implied endtags. E.g.:

	<li>Bad end tag? </il>
	<li>No problem!</il>
	<li> standard sgml will insert a proper "</li>" before the
erronous one and then ignore the erroneous one.

>Generally parsing real-world HTML is probably akin to heuristic
> natural language parsing, 

Not for viewing. For extracting data (screen scraping), sure.

> than it is to SGML parsing.

SGML parsing can be kinda nasty :)

For an excellent example of how to handle HTML at browser level
(hahahah!) quality, check out HTML Tidy (from the W3C). It basically
implemenets the kind of error handling the major browsers do and tries to
generate clean HTML or XHTML. The code is rather readable, and has both a
C and a Java version. It even has modes to help clean up Microsoft
generate "HTML".

Cheers,
Bijan Parsia.

From: Brian P Templeton
Subject: Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.?
Date: Sat, 02 Feb 2002 02:12:58 +0000
Message-ID: <87d6zpesu1.fsf@tunes.org>

Bijan Parsia <·······@email.unc.edu> writes:

> On 27 Jan 2002, Pierre R. Mai wrote:
> 
> [snip]
>> Note though that although parsing HTML should be a simple application
>> of a generic SGML (not XML, for that you need XHTML) parser, unless
>> you generate the HTML files yourself, it isn't:  There is all kind of
>> crap out there in the wild, that purports to be HTML (i.e. conforming
>> to some HTML DTD), and through heroic efforts on the part of the
>> browser writers does render OK.
> 
> I think it's a little more complicated. I.e., the standard behavior of
> "ignoring tags you don't understand" (a SGML trope, I believe) conceals a
> multitude of sins, at least for viewing purpose. Plus, the browser parsers
> *introduced* several pervasive bugs (Netscape comment handling, is an
> example, I forget for which versions).
> 
>>  So unless you can afford to tell 99%
>> percent of your users "Error: This is not HTML!", you'll need to handle
>> all sorts of erroneous stuff, like mis-matched start end end tags,
>> etc.  
> 
> Actually, for a fair number of cases, that's not so bad, because of
> implied endtags. E.g.:
> 
> 	<li>Bad end tag? </il>
> 	<li>No problem!</il>
> 	<li> standard sgml will insert a proper "</li>" before the
> erronous one and then ignore the erroneous one.
> 
>>Generally parsing real-world HTML is probably akin to heuristic
>> natural language parsing, 
> 
> Not for viewing. For extracting data (screen scraping), sure.
> 
>> than it is to SGML parsing.
> 
> SGML parsing can be kinda nasty :)
> 
> For an excellent example of how to handle HTML at browser level
> (hahahah!) quality, check out HTML Tidy (from the W3C). It basically
> implemenets the kind of error handling the major browsers do and tries to
> generate clean HTML or XHTML. The code is rather readable, and has both a
> C and a Java version. It even has modes to help clean up Microsoft
> generate "HTML".
  ^^^^^^^^
Surely you mean *de*generate here :)

> Cheers,
> Bijan Parsia.
> 

-- 
BPT <···@tunes.org>	    		/"\ ASCII Ribbon Campaign
backronym for Linux:			\ / No HTML or RTF in mail
	Linux Is Not Unix			 X  No MS-Word in mail
Meme plague ;)   --------->		/ \ Respect Open Standards

From: Paolo Amoroso
Subject: Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.?
Date: Sun, 27 Jan 2002 13:40:42 +0000
Message-ID: <hfZTPLJ+Hc36h8d0rjQMnFrShzIq@4ax.com>

On 26 Jan 2002 12:07:42 -0800, ·······@Yahoo.Com wrote:

> Does anybody know of any public-domain or other freely
> available Common-LISP sourcecode for parsing HTML files into
> a structured form, for example <bold>text</bold> should

Check `cllib', which is part of CLOCC (Common Lisp Open Code Collection):

  http://clocc.sourceforge.net/

It comes with some HTML parsing code. Also see the Web section of CLiki...

> More generally, does there exist any archive of
> freely-available Common-LISP sourcecode which is indexed by

The closest thing is CLiki:

  http://ww.telent.net/cliki


> If a fully indexed LISP-source archive with search engine
> isn't available, then what about just a LISP-source archive
> with an index by keywords or topics that is easily readable
> by a LISP program so that I could write the search engine
> myself?

There is the CMU Common Lisp Repository, but it is no longer being
maintained:

  ftp://ftp.cs.cmu.edu/user/ai/lang/lisp/


Paolo
-- 
EncyCMUCLopedia * Extensive collection of CMU Common Lisp documentation
http://web.mclink.it/amoroso/ency/README
[http://cvs2.cons.org:8000/cmucl/doc/EncyCMUCLopedia/]

From: ·······@inetmi.com
Subject: Re: Free CL sourcecode to parse HTML, RFC822 headers, etc.?
Date: Sat, 09 Feb 2002 04:53:15 +0000
Message-ID: <uheori7d0.fsf@chicago.inetmi.com>

·······@Yahoo.Com writes:

> Does anybody know of any public-domain or other freely
> available Common-LISP sourcecode for parsing HTML files into
> a structured form, for example <bold>text</bold> should
> parse as something like (bold "text") instead of as
> (start-bold)"text"(end-bold), and <a href="url">text</a>
> should parse as something like (href "url" "text") instead
> of as (start-href "url") "text" (end-href) likewise all
> other semantically nested meaning should be represented as
> nested s-expressions rather than sequences of start and end
> forms as the original HTML shows?

Check out Franz' open source HTML parser, at
<http://opensource.franz.com/xmlutils/index.html>.  It does at least
try to handle some illegal HTML.


John Wiseman