From: Robert Maas, see http://tinyurl.com/uh3t
Subject: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <rem-2007apr20-001@yahoo.com>
For years I've had needs for parsing HTML, but avoided writing a
full HTML parser because I thought it'd be too much work. So
instead I wrote various hacks that gleaned particular data from
special formats of HTML files (such as Yahoo! Mail folders and
individual messages) while ignoring the bulk of the HTML file.

But since I have a whole bunch of current needs for parsing various
kinds of HTML files, and I don't want to have to write a separate
hack for each format, all flakey/bugridden, I finally decided to
<cliche>bite the bullet</cliche> and write a genuine HTML parser.

Yesterday (Wednesday) I started work on the tokenizer, using one of
my small Web pages from years ago as the test data:
<http://www.rawbw.com/~rem/WAP.html>
As I was using TDD (Test-Driven Development) I discovered that the
file was still using the *wrong* syntax <p /> to make blank lines
between parts of the text, so I changed them to use valid code, so
now my HTML tokenizer would successfully work on the file, finished
to that point last night.

Then I switched to using the Google-Group Advanced-Search Web page
as test data, and finally got the tokenizer working for it after a
few more hours work today (Thursday).

Then I wrote the routine to take the list of tokens and find all
matching pairs of open tag and closing tag, replacing them with a
single container cell that included everyting between the tags.
For example (:TAG "font" ...) (:TEXT "hello") (:INPUT ...) (:ENDTAG "font")
would be replaced by ("CONTAIN "font" (...) ((:TEXT "hello") (:INPUT ...))).
I single-stepped it at the level of full collapses, all the way to
the end of the test file, so I could watch it and get a feel for
what was happening. It worked perfectly the first time, but I saw
an awful lot of bad HTML in the Google-Groups Advanced-Search page,
such as many <b> and <font> that were opened but never closed, and
also lots of <p> <p> <p> that weren't closed either. Even some
unclosed elements of tables.

Anyway, after spending an hour single-stepping it all, and finding
it working perfectly, I had a DOM (Document Object Model)
structure, i.e. the parse tree, for the HTML file, inside CMUCL, so
then of course I prettyprinted it to disk. Have a look if you're
curious:
<http://www.rawbw.com/~rem/NewPub/parsed-ggadv.dat.txt>
Any place you see a :TAG that means an opening tag without any
matching close tag. For <br>, and for the various <option> inside a
<select>, that's perfectly correct. But for the other stuff I
mentionned such as <b> and <font> that isn't valid HTML and never
was, right? I wonder what the w3c validator says about the HTML?
<http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.com%2Fadvanced_group_search%3Fhl%3Den>
   Result: Failed validation, 707 errors
No kidding!!! Over seven hundred mistakes in a one-page document!!!
It's amazing my parser actually parses it successfully!!
Actually, to be fair, many of the errors are because the doctype
declaraction claims it's XHTML transitional, which requires
lower-case tags, but in fact most tags are upper case. (And my
parser is case-insensitive, and *only* parses, doesn't validate at
all.) I wonder if all the tags were changed to lower case, how
fewer errors would show up in w3c validator? Modified GG page:
<http://www.rawbw.com/~rem/NewPub/tmp-ggadv.html>
<http://validator.w3.org/check?uri=http%3A%2F%2Fwww.rawbw.com%2F%7Erem%2FNewPub%2Ftmp-ggadv.html>
   Result: Failed validation, 693 errors
Hmmm, this validation error concerns me:
   145. Error Line 174 column 49: end tag for "br" omitted, but OMITTAG
       NO was specified.
My guess is some smartypants at Google thought it'd make good P.R.
to declare the document as XHTML instead of HTML, without realizing
that the document wasn't valid XHTML at all, and the DTD used was
totally inappropriate for this document. Does anybody know, from
eyeballing the entire WebPage source, which DOCTYPE/DTD
declaraction would be appropriate to make it almost pass
validation? I bet, with the correct DOCTYPE declaraction, there'd
be only fifty or a hundred validation errors, mostly the kind I
mentionned earlier which I discovered when testing my new parser.

From: Tim Bradshaw
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <1177063028.363207.70770@q75g2000hsh.googlegroups.com>
On Apr 20, 8:48 am, ·······@yahoo.com (Robert Maas, see http://tinyurl.com/uh3t)
wrote:
> but I saw
> an awful lot of bad HTML in the Google-Groups Advanced-Search page,
> such as many <b> and <font> that were opened but never closed, and
> also lots of <p> <p> <p> that weren't closed either. Even some
> unclosed elements of tables.

Depending on the version of HTML (on the DTD in use) omitted closing
tags may be perfectly legal.  SGML has many options to allow omission
of tags, both closing and opening.  This is one of the things that XML
did away with as it makes it impossible to build a parse tree for the
document unless you know the DTD.  So obviously they are not omissable
for any document claiming to be XHTML I think.

P for instance has omissable close tags in HTML 4.01

--tim
From: Toby A Inkster
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <j7bmf4-hco.ln1@ophelia.g5n.co.uk>
Robert Maas, see http://tinyurl.com/uh3t wrote:

> But since I have a whole bunch of current needs for parsing various
> kinds of HTML files, and I don't want to have to write a separate
> hack for each format, all flakey/bugridden, I finally decided to
> <cliche>bite the bullet</cliche> and write a genuine HTML parser.

Congratulations. Real parsers are fun.

But wouldn't it have been a bit easier to reuse one of the many existing
parsers? e.g. http://opensource.franz.com/xmlutils/xmlutils-dist/phtml.htm

-- 
Toby A Inkster BSc (Hons) ARCS
http://tobyinkster.co.uk/
Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

* = I'm getting there!
From: Robert Uhl
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <m3ejmewyzr.fsf@latakia.dyndns.org>
Toby A Inkster <············@tobyinkster.co.uk> writes:
>
> But wouldn't it have been a bit easier to reuse one of the many
> existing parsers?
> e.g. http://opensource.franz.com/xmlutils/xmlutils-dist/phtml.htm

Not as fun as to write one's own, no doubt.  Plus, depending on the
program which consumes the parser's output, might be too difficult to
massage another parser's output.

-- 
Robert Uhl <http://public.xdi.org/=ruhl>
If we were in a pub, I would have bought you a pint to
commiserate, and _then_ laughed at you.
                      --Stephen Harris
From: John Thingstad
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <op.tq2zuoh5pqzri1@pandora.upc.no>
On Fri, 20 Apr 2007 09:48:18 +0200, Robert Maas, see  
http://tinyurl.com/uh3t <·······@yahoo.com> wrote:

>
> Anyway, after spending an hour single-stepping it all, and finding
> it working perfectly, I had a DOM (Document Object Model)
> structure, i.e. the parse tree, for the HTML file, inside CMUCL, so
> then of course I prettyprinted it to disk. Have a look if you're
> curious:
> <http://www.rawbw.com/~rem/NewPub/parsed-ggadv.dat.txt>
> Any place you see a :TAG that means an opening tag without any
> matching close tag. For <br>, and for the various <option> inside a
> <select>, that's perfectly correct. But for the other stuff I
> mentionned such as <b> and <font> that isn't valid HTML and never
> was, right? I wonder what the w3c validator says about the HTML?
> <http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.com%2Fadvanced_group_search%3Fhl%3Den>
>    Result: Failed validation, 707 errors
> No kidding!!! Over seven hundred mistakes in a one-page document!!!
> It's amazing my parser actually parses it successfully!!
> Actually, to be fair, many of the errors are because the doctype
> declaraction claims it's XHTML transitional, which requires
> lower-case tags, but in fact most tags are upper case. (And my
> parser is case-insensitive, and *only* parses, doesn't validate at
> all.) I wonder if all the tags were changed to lower case, how
> fewer errors would show up in w3c validator? Modified GG page:
> <http://www.rawbw.com/~rem/NewPub/tmp-ggadv.html>
> <http://validator.w3.org/check?uri=http%3A%2F%2Fwww.rawbw.com%2F%7Erem%2FNewPub%2Ftmp-ggadv.html>
>    Result: Failed validation, 693 errors
> Hmmm, this validation error concerns me:
>    145. Error Line 174 column 49: end tag for "br" omitted, but OMITTAG
>        NO was specified.
> My guess is some smartypants at Google thought it'd make good P.R.
> to declare the document as XHTML instead of HTML, without realizing
> that the document wasn't valid XHTML at all, and the DTD used was
> totally inappropriate for this document. Does anybody know, from
> eyeballing the entire WebPage source, which DOCTYPE/DTD
> declaraction would be appropriate to make it almost pass
> validation? I bet, with the correct DOCTYPE declaraction, there'd
> be only fifty or a hundred validation errors, mostly the kind I
> mentionned earlier which I discovered when testing my new parser.

As a ex employee of Opera I can say that writing a Web Browser is hard!
It is not so much the parsing of correct HTML as the parsing of incorrect
HTML that poses the problem. Let's face it. It could be simple.
If we all used XHTML and the browser aborted with a error message
when a error occurred. Unfortunately that is hardly the case.
SGML is more difficult to parse. Then there is the fact that many
cites rely on errors in the HTML being handled just like in
Microsoft Explorer. I can't count the number of times I heard that Opera
was broken just to find that it was a HTML error on the web cite that
Explorer got around.

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
From: Thomas F. Burdick
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <1177074185.233559.24140@p77g2000hsh.googlegroups.com>
On Apr 20, 2:05 pm, "John Thingstad" <··············@chello.no> wrote:

> As a ex employee of Opera I can say that writing a Web Browser is hard!
> It is not so much the parsing of correct HTML as the parsing of incorrect
> HTML that poses the problem. Let's face it. It could be simple.
> If we all used XHTML and the browser aborted with a error message
> when a error occurred. Unfortunately that is hardly the case.

This is unfortunate why?  Because of the high correlation between
people who have something to say worth reading and those who can write
XML without screwing it up?  Face it, HTML is a markup language
historically created directly by humans, which means you *will* get
good content with syntax errors by authors who will not fix it.
From: dpapathanasiou
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <1177081954.773575.230030@d57g2000hsg.googlegroups.com>
> This is unfortunate why?  Because of the high correlation between
> people who have something to say worth reading and those who can write
> XML without screwing it up?  Face it, HTML is a markup language
> historically created directly by humans, which means you *will* get
> good content with syntax errors by authors who will not fix it.

But this problem was entirely preventable: if Netscape and early
versions of IE had rejected incorrectly-formatted html, both people
hacking raw markup and web authoring tools would have learned to
comply with the spec, and parsing html would not be the nightmare it
is today.
From: Thomas A. Russ
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <ymik5w7yor4.fsf@sevak.isi.edu>
dpapathanasiou <···················@gmail.com> writes:

> > This is unfortunate why?  Because of the high correlation between
> > people who have something to say worth reading and those who can write
> > XML without screwing it up?  Face it, HTML is a markup language
> > historically created directly by humans, which means you *will* get
> > good content with syntax errors by authors who will not fix it.
> 
> But this problem was entirely preventable: if Netscape and early
> versions of IE had rejected incorrectly-formatted html, both people
> hacking raw markup and web authoring tools would have learned to
> comply with the spec, and parsing html would not be the nightmare it
> is today.

On the other hand, it could also be argued that, especially early on,
before web authoring tools existed, such laxity contributed to the
widespread adoption of html.  By making the renderer not particularly
picky about the input, it made it easier for authors to hand create the
html pages without the frustration of having things get rejected and not
appear at all.  

That provided a nicer development environment (somewhat reminiscent of
Lisp environments), where things would work, even if not every part of
the document were well-formed and correct.  The author could then go
back and fix the places that didn't work.  That would be true even if
correct rendering were strict, but I do think that laxness in
enforcement of the standards helped the spread of html in the early
days.

-- 
Thomas A. Russ,  USC/Information Sciences Institute
From: Robert Maas, see http://tinyurl.com/uh3t
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <rem-2007apr22-001@yahoo.com>
> From: ····@sevak.isi.edu (Thomas A. Russ)
> it could also be argued that, especially early on, before web
> authoring tools existed, such laxity contributed to the
> widespread adoption of html.  By making the renderer not
> particularly picky about the input, it made it easier for authors
> to hand create the html pages without the frustration of having
> things get rejected and not appear at all.

That part is fine, but what you say next isn't quite right...

> That provided a nicer development environment (somewhat
> reminiscent of Lisp environments), where things would work, even
> if not every part of the document were well-formed and correct.

There are two major aspects of Lisp environments, only one of which
is present in an HTML-coding/viewing environment:
- Tolerable of mistakes, one mistake doesn't abort compilation and
   cause totally null output except for compiler diagnostics. TRUE
- Interactive R-E-P loop whereby you can instantly see the result
   of each line of code as you write it, and after a mistake
   immediately modify your attempt until you get it right before
   moving on to the next line of code. NO!!
The interactive model for any Web-based service (HTML, CGI, PHP,
etc.) is very different from Lisp's (or Java's BeanShell's) R-E-P.
Web-based services always deal with an entire application, even if
some parts aren't yet written, either missing or stubbed to show
where they *would* be. The entire application is always re-started
each time you try one new line of code, and you must manually
search to the bottom of the output to see where it is, which is
more work for the human visual processing system than just watching
the R-E-P scroll by where your input is immediately followed by the
corresponding output. (And if you insert new code *between*
existing code, then it's even more effort to scroll to where the
new effect should be located to see how it looks when rendered.)

As an example of this difference without changing languages, I
write both R-E-P applications and CGI applications using Common
Lisp. Whenever I am writing a CGI application, it's a lot more
hassle, because of the totally different debugging environment. I'm
constantly fighting it somehow. Depending on the application, or
which part of the application I'm writing, I use one of two
strategies:
- If I'm writing a totally simple application, I copy an old CGI
   launchpad (a .cgi file which does nothing except invoke CMUCL
   with appropriate Lisp file to load) and change the name of the
   file (the name of Lisp file to load), then I create a dummy Lisp
   file which does nothing except make the call to my library
   routine to generate CGI/MIME header for either TEXT/PLAIN or
   TEXT/HTML, and print a banner, and exit. Then I immediately
   start the Web browser to make sure I have at least that "hello
   world" piece of trivia working at all before I go on. Then I add
   one new line of code at a time and immediately re-load the Web
   page, to force re-execution of the entire program up to that
   point, and scroll if necessary to bring the result of the new
   line of code on-screen. I include a lot of FORTRAN-style
   debugging printouts to explicitly show me the result of each new
   line of code even if that result wouldn't normally be shown
   during normal running of the finished application. After a few
   more lines of code have been added and FORTRAN-style
   debug-printed out, I start to comment out some of the early
   debug-print statements that I'm tired of seeing over and over.
   This necessity to add a print statement for virtually every new
   line of code is a big nuisance compared to the R-E-P loop where
   that always happens by default, and commenting out the print
   statements late is a nuisance compared to the R-E-P loop where
   old printout simply scrolls off screen all by itself.
- Whenever I write a significant D/P (data-processing algorithm),
   to avoid the hassle described above, I usually develop the
   entire algorithm in the normal R-E-P loop, then port it to CGI
   using the above technique at the very end, so only the interface
   from CGI and the toplevel calls to various algorithms need be
   debugged in the CGI environment with FORTRAN-style print
   statements etc. If the algorithm needs the results from a HTML
   form, sometimes I first write a dummy application which does
   nothing except call the library function to decode the
   urlencoded form contents, then print the association list to
   screen. Then I run it once, copy the association list from
   screen and paste into the R-E-P debug environment. Then after
   the algorithm using that data has been completely debugged, I
   splice a call to the algorithm back into the CGI application and
   finish up debugging there.
The point is that debugging in a Web-refresh-to-restart-whole-program
environment is so painful compared to R-E-P that I avoid it as much
as possible. But with HTML (or PHP), there's no alterative. There
simply is no way, that I know of anyway, to develop new code in a
R-E-P sort of environment.

Now to be fair, in HTML nearly *every* line of code written (not
counting stylesheets, which are recent compared to "early HTML"
discussion here) produces some visual effect which is physically
located in the same relationships to other visual effects as the
physical relationship of the corresponding source (HTML) code. So
we never have to add extra "print" statements and later comment
them out. At most we might sometimes have to add extra visual
characters around white space just to show whether the white space
is really there, since white space at end of line doesn't show
visually. But still, the need to type input in one window and then
switch to another window and deliberately invoke a page-reload
command and then wait for a network transaction (even if working on
local server) before seeing the result, and *not* seeing the source
code and visual effect together on one screen where the eye can
dart back and forth to spot what mistake in source caused the bad
output, is a significant pain during development, whereby your glib
comparison between HTML code development and Lisp R-E-P code
development just isn't true.

Now if somebody could figure out a way to "block" pieces of HTML
code so that it would be possible for a develoopment environment to
alternate showing source code and rendered output within a single
window, and in fact the programmer could type the source directly
onto this intermix-window, either typing a new block of code at the
bottom, or editing an old block of code, that would make it like
the Lisp R-E-P. But then since HTML is primarily a visual-effect
language, and what is really being debugged is the way text looks
nice laid out on a page, the interspersed source would ruin the
visual effect and in some ways make debugging more difficult. So
maybe instead it could use a variation of the idea whereby the main
display screen shows exactly the rendered output, but aside it is
the source screen, with blocks of code mapped to blocks of
presentation via connecting brackets, somewhat like this:

          PRESENTATION                   SOURCE
                         ----+   +----
    Hi, this is a paragraph  |   |   <p>Hi,
    of rendered text, all    +---+   this is a paragraph of rendered text,
    nicely aligned. I wonder |   |   all nicely aligned.
    if it will work?         |   |   I wonder if it will work?</p>
                         ----+   +----

But of course, although that might help today's HTML authors if
somebody created such a tool, no such tool existed back in the
early days we're talking about here, so my argument about the pain
of HTML coding compared to Lisp R-E-P stands.

(Also, it may be difficult to work with tables using the idea of
sequential blocks of HTML source, in fact the whole idea may be
useless for such "interesting" (in Chinese sense) coding.)

> The author could then go back and fix the places that didn't
> work.

Which is rather different from Lisp R-E-P development, where you
hardly ever have to go *back* to fix stuff that didn't work, rather
you fix it immediately while it's still the latest thing you wrote.
If you try to write a whole bunch of Lisp code without bothering to
test each part individually, and *then* you try to run the whole
mess, what happens is similar to what happens when programming in
C, the very first thing that bombs out causes nothing after it to
be properly tested at all. This is a significant difference between
HTML (and other formatting languages, where the various parts of
the script are rather independent), vs. any programming langauge
where later processing is heavily dependent on earlier results.

Now for a real bear, try PHP: It works *only* in a Web environment,
so you can't try it in a interactive environment as you could with
Lisp or Perl, but it's a true programming language, where later
processing steps are heavily dependent on early results, so you
can't just throw together a lot of stuff (as with HTML) and debug
all the independent pieces in any sequence you want. You are
essentially forced to use that painful style of development I
described as the first (least preferred) style of CGI programming.

Back to the main topic: One thing, for the early days, which might
have bridged the gap between sloopy first-cut HTML where the
browser guesses what you really meant (and different browsers guess
differently) and good HTML, would be a way of switching "pedantic"
mode off and on. But hardly any C programmers ever use the pedantic
mode, so why should we expect HTML authors to do so either??

The bottom line is that there's a conflict between ease of
first-cut authoring that made HTML so popular in the early days,
and strict following of the specs to make proper HTML source, and I
don't see any easy solution. Maybe the validation services (such as
W3C provides), together with a "hall of shame" for the worse
offenders at HTML that grossly fails validation, would coerce some
decent fraction of authors to eventually fix their original HTML to
become proper HTML?? (Or maybe Google could do validation on all
Web sites it indexes, and demote any site that fails validation, so
it doesn't show up in the first page of search results, and the
more severely a Web page fails validation the further down the
search results it's forced? If Google can fight the government of
the USA regarding invasion of privacy of users, maybe they can try
my idea here too?? Google *is* the 800 pound gorilla of the Web,
and if they applied reward/punishment to good/bad Web authors, I
think it would have a definite effect. Unfortunately, Google is one
of the wost offenders, as I noted the other day. Nevermind...)

Anybody want to join me in building a Hall of Shame for HTML
authors, starting with Google's grossly bad HTML (declared as
transitional XHTML which is totally bogus, ain't even close to
XHTML)?
From: Andy Dingley
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <1177090474.933189.303240@d57g2000hsg.googlegroups.com>
On 20 Apr, 16:12, dpapathanasiou <···················@gmail.com>
wrote:

> But this problem was entirely preventable: if Netscape and early
> versions of IE had rejected incorrectly-formatted html, both people
> hacking raw markup and web authoring tools would have learned to
> comply with the spec,

We'd also still be using HTML 1.0, as the legacy problems would stifle
any change to the standard.
From: Paul Wallich
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <f0b6ds$qf0$1@reader2.panix.com>
Andy Dingley wrote:
> On 20 Apr, 16:12, dpapathanasiou <···················@gmail.com>
> wrote:
> 
>> But this problem was entirely preventable: if Netscape and early
>> versions of IE had rejected incorrectly-formatted html, both people
>> hacking raw markup and web authoring tools would have learned to
>> comply with the spec,
> 
> We'd also still be using HTML 1.0, as the legacy problems would stifle
> any change to the standard.

Remember that originally no one was supposed to write HTML. It was 
supposed to be produced automagically by design tools and transducers 
operating on existing formatted documents.

You know, the same way that no one is supposed to write in assembler.

paul
From: Pascal Costanza
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <58s4rtF2icrduU1@mid.individual.net>
dpapathanasiou wrote:
>> This is unfortunate why?  Because of the high correlation between
>> people who have something to say worth reading and those who can write
>> XML without screwing it up?  Face it, HTML is a markup language
>> historically created directly by humans, which means you *will* get
>> good content with syntax errors by authors who will not fix it.
> 
> But this problem was entirely preventable: if Netscape and early
> versions of IE had rejected incorrectly-formatted html, both people
> hacking raw markup and web authoring tools would have learned to
> comply with the spec, and parsing html would not be the nightmare it
> is today.

If early browsers had rejected incorrect html, the web would have never 
been that successful.

What's important to keep in mind is that those who create the content 
are end-users. It must be easy to create content, and shouldn't require 
any specific skills (or not more than absolutely necessary).

Stupid error messages from stupid technology is a hindrance, not an enabler.


Pascal

-- 
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/
From: Tim Bradshaw
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <1177083985.652662.169080@p77g2000hsh.googlegroups.com>
On Apr 20, 4:34 pm, Pascal Costanza <····@p-cos.net> wrote:
>
> If early browsers had rejected incorrect html, the web would have never
> been that successful.
>
> What's important to keep in mind is that those who create the content
> are end-users. It must be easy to create content, and shouldn't require
> any specific skills (or not more than absolutely necessary).
>
> Stupid error messages from stupid technology is a hindrance, not an enabler.

Well said.
From: dorayme
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <doraymeRidThis-30209A.10450121042007@news-vip.optusnet.com.au>
In article 
<························@d57g2000hsg.googlegroups.com>,
 dpapathanasiou <···················@gmail.com> wrote:

> > This is unfortunate why?  Because of the high correlation between
> > people who have something to say worth reading and those who can write
> > XML without screwing it up?  Face it, HTML is a markup language
> > historically created directly by humans, which means you *will* get
> > good content with syntax errors by authors who will not fix it.
> 
> But this problem was entirely preventable: if Netscape and early
> versions of IE had rejected incorrectly-formatted html, both people
> hacking raw markup and web authoring tools would have learned to
> comply with the spec, and parsing html would not be the nightmare it
> is today.

It's a nice fantasy that a zero tolerance policy would work. Face 
it, someone would bring out a competitor that tolerated faults 
and everyone would rush to use that one instead.

-- 
dorayme
From: mbstevens
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <pan.2007.04.21.00.52.34.76203@xmbstevensx.com>
On Sat, 21 Apr 2007 10:45:01 +1000, dorayme wrote:

> In article 
> <························@d57g2000hsg.googlegroups.com>,
>  dpapathanasiou <···················@gmail.com> wrote:
> 
>> > This is unfortunate why?  Because of the high correlation between
>> > people who have something to say worth reading and those who can write
>> > XML without screwing it up?  Face it, HTML is a markup language
>> > historically created directly by humans, which means you *will* get
>> > good content with syntax errors by authors who will not fix it.
>> 
>> But this problem was entirely preventable: if Netscape and early
>> versions of IE had rejected incorrectly-formatted html, both people
>> hacking raw markup and web authoring tools would have learned to
>> comply with the spec, and parsing html would not be the nightmare it
>> is today.
> 
> It's a nice fantasy that a zero tolerance policy would work. Face 
> it, someone would bring out a competitor that tolerated faults 
> and everyone would rush to use that one instead.

You need to set up a switch 0-9 for how much crap code the parser will
accept.
From: Robert Uhl
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <m37is5x24f.fsf@latakia.dyndns.org>
mbstevens <·············@xmbstevensx.com> writes:
>
> You need to set up a switch 0-9 for how much crap code the parser will
> accept.

Eh, 1-12 is better--then you have more sensible fractional settings
(1/6, 1/4, 1/3, 1/2, 2/3, 3/4, 5/6 vs 1/5, 2/5, 1/2, 3/5, 4/5).  For a
simple bogosity switch, why not just accept-bogosity-p?

-- 
Robert Uhl <http://public.xdi.org/=ruhl>
progress (n): the process through which Usenet has evolved from smart
people in front of dumb terminals to dumb people in front of smart
terminals                                --obs at burnout.demon.co.uk
From: Kent M Pitman
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <uabx1zp6c.fsf@nhplace.com>
Robert Uhl <·········@NOSPAMgmail.com> writes:

> mbstevens <·············@xmbstevensx.com> writes:
> >
> > You need to set up a switch 0-9 for how much crap code the parser will
> > accept.
> 
> Eh, 1-12 is better--then you have more sensible fractional settings
> (1/6, 1/4, 1/3, 1/2, 2/3, 3/4, 5/6 vs 1/5, 2/5, 1/2, 3/5, 4/5).  For a
> simple bogosity switch, why not just accept-bogosity-p?

In the HTML parsers I've written, the problem isn't even writing a parser
for these things, it's the enormous and unwanted burden of resolving the
errors in a way that makes people happy.

The problem with HTML as a spec is indeed that it presupposes no
typos.  And yet the design mistake set early on was to tolerate them.
If Mosaic (the original browser) had just said 
  "Bad syntax in page.  Can't display it."
I think it was Mosaic that started the trend of correcting errors, but
maybe it came later.  But my point is that the early competition in
browsers was NOT in correctness, but in tolerance.  And what resulted
was a "semantics" which is simply not described by the spec.  And that 
means "implementing it" does not mean "implementing the spec".

In later revs of the standards, I seem to recall that they finally
figured out that, for example, table layout was underspecified.  I
recall implementing table layout for a web browser I wrote in Lisp
once [*], and the issue was that the constraint relaxation in
tables had choice points that if you didn't do right, people would
complain you'd done it wrong because it didn't display stuff like
Netscape and/or IE did.  So finally they've started to explain how
those issues are to be resolved.  That's both good and bad, though,
because what they did was entrench the Rightness of particular
commercial endeavors and not the Rightness of correct thought.  This
isn't a grumbling about winners and losers under capitalism, this is
an observation that it's favoring, indirectly, a particular
implementation strategy over others that might be faster, smaller,
more extensible, etc.  Because those strategies might be perfectly
suitable for correct HTML, but if they can't handle junk, they are not
favored in spite of their other features.

So what makes HTML hard is that HTML does not _mean_ HTML.  In the
end, HTML means Internet Explorer or Mozilla or something... and when
people say your browser works, they mean "it works like those".  They
don't mean "it works like the spec".  One is not rewarded for
succeeding, one is rewarded for deliberately not succeeding, and for
creating a system that encourages others to do likewise.  Depending,
of course, on what you accept as the goal.

Modularizing the task into something that corrects bad HTML to good 
and something that displays good HTML is probably the way to go.
Parsers for bad HTML don't have to know about HTML "meaning", just its
structure.  Since I don't know of a public spec for what corrections
are required of browsers to not get yelled at, I don't know for sure,
but my sense is that the repair operations are not allowed to depend
on the high level semantics.  I think they're mostly about how to 
repair missing ">"'s or how to treat missing end tags or how to treat
errors in element attributes or how to treat misbalanced elements like
end tags in the wrong order.  It's possible that the special treatment
of anchors, allowing them to span from the middle of one tree to the
middle of another is a violation of what I said about not knowing the
high level semantics.  But those are the kinds of things one needs
such a preprocessor to do.

So I would think it's qualitative/discrete control one wants, not
numerical/fractional/percentage based.  You either do or do not want
such fixups. You might want to enable/disable specific ones.  But
saying a percentage is weird.  It suggests a homogeneity to the
problem that there is not, and it suggests a canonical ordering to the
fixes that says the ones at one end are "must do" and the others are
"maybe" and that throttling up or down will hit them in the right
order.  I don't see it.  What I see in such systems is a laziness in
the design of the controls, or a cynical theory of such laziness of
users in controlling them that I find myself wondering why allow them
to have control at all.  

I don't see how the strangeness of HTML handling can be improved by a
numeric slider for bogosity.  Control should be designed for those who
have the presence of mind to be thoughtful, not brought on a platter
for those who don't know what they're doing to fiddle mindlessly.  

It reminds me of the way I'm always telling about why I don't like
Windows XP in its default configuration.  The real controls are
papered over, showing you only a cartoon caricature.  The analogy I
always use is that of a microwave that used to have settings for heat
and time replaced by a newer model that has only two buttons: popcorn
and steak.  Yes, the controls are simpler.  No, they are not more
useful.

- - - -

[*] The web browser I refer to is unrelated to the web server I wrote
 for my own consulting company.  The browser was something I wrote 
 at Harlequin, before it folded.  Like many interesting things Harlequin
 did internally, never productized.  I don't know what ended up happening
 to that code.  It wasn't product-worthy, but it was functional at a 
 prototype level, capable of displaying early web pages with fonting,
 tables, gifs, etc.  There as no scripting back then and the HTML spec
 was smaller.  (But a lot of the growth in the HTML spec has been to 
 clarify questions everyone had to make anyway and to give users access
 to them... the basic concepts didn't really change.)

 Incidentally, I claim I was the only one (or perhaps one of a very
 few) who really did "accidentally" author a web browser, a feat
 remarked about in in a dilbert cartoon, with ratbert commenting on
 that.  The browser was originally written as an HTML->PostScript
 rendering tool, and when I got done I said "Hey, I bet if I changed
 the back-end to use CLIM instead of PostScript, it would be a
 browser.  It turned out the only thing I had to add besides modular
 display changes was support for anchors and there it was, without
 even giving thought to wanting to build one in advance.  A true 
 accident.  I'd never have gotten the time allocated to build it 
 otherwise.

 But it wasn't a tasked project, and so it didn't have anyone saying
 it had to become finished.  What it became was more of a vehicle for
 debugging myriad CLIM bugs...  

 Ah well. Such is the fate of the best laid (non)plans of mice and men,
 I guess... with due apologies to Ratbert and Dilbert for the ill-placed
 metaphor.
From: Don Geddis
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <87k5w58kjp.fsf@geddis.org>
Kent M Pitman <······@nhplace.com> wrote on 21 Apr 2007 12:0:
> The problem with HTML as a spec is indeed that it presupposes no
> typos.  And yet the design mistake set early on was to tolerate them.
> If Mosaic (the original browser) had just said 
>   "Bad syntax in page.  Can't display it."
> I think it was Mosaic that started the trend of correcting errors, but
> maybe it came later.  But my point is that the early competition in
> browsers was NOT in correctness, but in tolerance.  And what resulted
> was a "semantics" which is simply not described by the spec.  And that 
> means "implementing it" does not mean "implementing the spec".

I agree with everything you've written, except with the phrase "design
mistake".  That suggests that someone made a clear error, and if we could
do it all over again anyone would obviously make the other choice.

I don't think that's at all clear.  When the web was young, wide adoption was
the issue.  How important was this new technology going to be in the world?
And the initial limiting factor was content.  There just wasn't that much out
there.  Like many technical innovations, it's easy to construct a
hypothetical that _if_ the whole world adopted your approach, then great
value would result.  Yet it's always tough to get from here to there.

Netscape and IE being so tolerant of errors makes it tough for browser
programmers.  But there are only a few of them, and they're smart, and they
can afford to put some extra effort into the problem.

Meanwhile, where does the content come from?  Why do grandmothers make web
pages with their favorite recipes on them?  How many harried parents of
newborns managed to slap together a web page with photos of their babies?

Things are a little different now, as there are lots of publishing tools
available.  But in the early days, the critical factor was getting ordinary
folks to successfully create content.  Making the browsers more strict would
have resulted in them being far less useful, and it would have discouraged
the creation of additional content, possibly endangering the growth and
importance of the overall web.

After all, why did the web succeed anyway?  It's not like hypertext was a new
idea.  Why did Xanadu, decades earlier and with the "better" idea of
bidirectional links instead of unidirectional links, fail, while the web
succeeded?

I think when you try to analyze questions like that, you'll see that early
browsers being tolerant of HTML errors is right in line with a large number
of decisions made for the web that were all designed for ease of adoption.

I don't think the world would have been better if early browsers rejected
web pages which violated the HTML spec.

        -- Don
_______________________________________________________________________________
Don Geddis                  http://don.geddis.org/               ···@geddis.org
Because the innovator has for enemies all those who have done well under the
old conditions, and lukewarm defenders in those who may do well under the
new.  -- Niccolo Machiavelli, _The Prince_, Chapter VI
From: Kent M Pitman
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <uejmd3web.fsf@nhplace.com>
Don Geddis <···@geddis.org> writes:

> Kent M Pitman <······@nhplace.com> wrote on 21 Apr 2007 12:0:
> > The problem with HTML as a spec is indeed that it presupposes no
> > typos.  And yet the design mistake set early on was to tolerate them.
> > If Mosaic (the original browser) had just said 
> >   "Bad syntax in page.  Can't display it."
> > I think it was Mosaic that started the trend of correcting errors, but
> > maybe it came later.  But my point is that the early competition in
> > browsers was NOT in correctness, but in tolerance.  And what resulted
> > was a "semantics" which is simply not described by the spec.  And that 
> > means "implementing it" does not mean "implementing the spec".
> 
> I agree with everything you've written, except with the phrase
> "design mistake".  That suggests that someone made a clear error,
> and if we could do it all over again anyone would obviously make the
> other choice.
> 
> I don't think that's at all clear.  When the web was young, wide
> adoption was the issue.  How important was this new technology going
> to be in the world?  And the initial limiting factor was content.
> There just wasn't that much out there.  Like many technical
> innovations, it's easy to construct a hypothetical that _if_ the
> whole world adopted your approach, then great value would result.
> Yet it's always tough to get from here to there.

Actually, I've said the same thing myself many times, and I'm definitely
not in disagreement with any of this.

The SGML community had tried to sell HTML for years and they hated the
laxness of HTML, and probably hated more that it was what made it
successful.  Forces tried to migrate HTML to XML, even to the point of
trying to hit XHTML as an option, and that's even met with openly
admitted failure.

So probably you're right, I was overly harsh and perhaps misleading
with the word "mistake".  I didn't mean it to mean "wrong path".  I
meant it to mean "highly articulable event that explains a later
unwanted thing".

> Netscape and IE being so tolerant of errors makes it tough for
> browser programmers.  But there are only a few of them, and they're
> smart, and they can afford to put some extra effort into the
> problem.

(And makes it hard for new entrants in the market... something probably
neither of them minded.)

Ok, since I'm hell-bent on saying there was a mistake, and you've
argued well that I've pointed to the wrong locus, let me try again.

Perhaps I should say the mistake was not in tolerating the mistakes,
but rather in tolerating them without seeing a need to say them aloud
and standardize the handling... The desire to pretend it wasn't
happening and was not worthy of comment, rather than noting that there
should be a real, shared spec among all browser-makers for how to do
this.

It reminds me of the fight that used to go on between Scheme designers
about handling errors and portability problems.  Many didn't want to
lengthen the Scheme spec because the brevity of it was its virtue.
Saying what happens in weird cases became less important than
pretending those things were unimportant.  And that led to a certain
kind of a world.

CL differs in many technical details, but on this occasion we'll
ignore those.  I'll allege for discussion here, just to make a point,
that the key difference between CL and Scheme is not the decisions it
made, but what its goal was in making those decisions.  Scheme was
about satisfying teachers' need to oversimplify the world for the sake
of making some intellectual points, I'll say (oversimplifying the
point for the sake of making some intellectual points of my own).  CL
was about making CL practical in real world situations.  In CLTL, we
papered over some messy places just to get agreement at all.  By the
time ANSI CL came long, it was clear that continuing to pretend
porting was not an issue would doom the language, so we hit it head
on.  See the remarks about aesthetics in
http://www.nhplace.com/kent/CL/x3j13-86-020.html Note that this did
not mean we were a band of people too stupid to think aesthetics were
important, but rather we were a set of practical people to know that
caring only about aesthetics was going doom the language.

In fairness, too, I'll say the omission was probably not just laziness
but tactics.  I think the browser makers consciously waged a war to
tolerate things others didn't in hopes that peope would describe
Browser A as broken for failing to handle Browser B's favorite
idiosyncracy.  Sometimes it was a fix, sometimes it was an extra
element, but the apparent idea was to subtly disempower the browser
companies that couldn't keep up.  It is out of this kind of gaming that
the need for standards ultimately arises.

At the present time, we regard such competition among search engines
to be good.  Some correct typos of one kind or another, or allow
interesting search syntax.  As search engines evolve, and people build
things that start to use them in layered ways, I suspect we'll start
to spec out what is needed of them, so that it isn't left to chance.
You can't get to the next layer without the underlayer being firm.
From: Don Geddis
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <87slar4utb.fsf@geddis.org>
Kent M Pitman <······@nhplace.com> wrote on 21 Apr 2007 23:4:
> Actually, I've said the same thing myself many times, and I'm definitely
> not in disagreement with any of this.

Oh.  Well.  If you're going to be all polite about it, that's going to make
it tough for me to maintain my outrage.

> Perhaps I should say the mistake was not in tolerating the mistakes,
> but rather in tolerating them without seeing a need to say them aloud
> and standardize the handling...

Now that's a good idea I hadn't thought of.  Why not just incorporate the
auto-bug-fixing into the standard itself?  Brilliant!

> It reminds me of the fight that used to go on between Scheme designers
> about handling errors and portability problems.  Many didn't want to
> lengthen the Scheme spec because the brevity of it was its virtue.
> Saying what happens in weird cases became less important than
> pretending those things were unimportant.

A good analogy.  Yes, ignoring a problem doesn't mean it will go away.

        -- Don
_______________________________________________________________________________
Don Geddis                  http://don.geddis.org/               ···@geddis.org
  And Jesus said unto them, "And whom do you say that I am?"
  They replied, "You are the eschatological manifestation of the ground of
our being, the ontological foundation of the context of our very selfhood
revealed."
  And Jesus replied, "What?"
From: Frank Buss
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <12l9o0jccr703.18unvq6tjskoo$.dlg@40tude.net>
Don Geddis wrote:

> Things are a little different now, as there are lots of publishing tools
> available.  But in the early days, the critical factor was getting ordinary
> folks to successfully create content.  Making the browsers more strict would
> have resulted in them being far less useful, and it would have discouraged
> the creation of additional content, possibly endangering the growth and
> importance of the overall web.

Some browser, like Netscape, had integrated editors. I don't think that the
grandmother starts Emacs and writes HTML tags. Providing good editors from
the beginning would have been better for the popularity of the web than bug
tolerant browsers.

-- 
Frank Buss, ··@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
From: dpapathanasiou
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <1177249433.095958.266690@y5g2000hsa.googlegroups.com>
> Netscape and IE being so tolerant of errors makes it tough for browser
> programmers.  But there are only a few of them, and they're smart, and they
> can afford to put some extra effort into the problem.

Whenever I have this discussion about the laxity of Netscape and IE
wrt the original html spec, most of my friends reply as Pascal did.

And unfortunately while we cannot go back in time and test the
hypothesis, I think Don's point, above, frames that argument best.
From: Frank Buss
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <1i93mb8zxh3q8$.pxab8vufboku$.dlg@40tude.net>
Kent M Pitman wrote:

> I don't see how the strangeness of HTML handling can be improved by a
> numeric slider for bogosity. 

No problem: count the different types of problems of a collection of pages,
e.g. from 256 pages there are 23 pages with missing closing tags, 45 pages
with interleaved tags etc. Now you can mix all features to archive the
desired percentage of compatibility: If you want 40% compatibility with IE,
choose the features set, which results in accepting 40% of the pages :-)

-- 
Frank Buss, ··@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
From: Kent M Pitman
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <uslatvd9v.fsf@nhplace.com>
Frank Buss <··@frank-buss.de> writes:

> Kent M Pitman wrote:
> 
> > I don't see how the strangeness of HTML handling can be improved by a
> > numeric slider for bogosity. 
> 
> No problem: count the different types of problems of a collection of pages,
> e.g. from 256 pages there are 23 pages with missing closing tags, 45 pages
> with interleaved tags etc. Now you can mix all features to archive the
> desired percentage of compatibility: If you want 40% compatibility with IE,
> choose the features set, which results in accepting 40% of the pages :-)

Right.  As long as you understand that the choice of set members matters,
and that not all 40% solutions are in an equivalence class of behavior, 
you've got my point.

And I'll leave aside the non-canonicality of the thing you're taking
percent of.  (Percent of pages? Percent of covered features? Percent
of error instances?)  Sort of like the US House and US Senate both are
proportionally distributed in their effect... one counts people, one
counts states.  Both equal in some way, yet very different in effect.

Certainly a browser that interpreted "40%" as "I have all the possible
fixing code, but I'll apply it to only 4 out of 10 pages" or "to only 4
out of 10 observed typos" would have funny effects. ;)
From: David Lichteblau
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <slrnf2krr7.3lc.usenet-2006@babayaga.math.fu-berlin.de>
On 2007-04-21, Kent M Pitman <······@nhplace.com> wrote:
> [*] The web browser I refer to is unrelated to the web server I wrote
>  for my own consulting company.  The browser was something I wrote 
>  at Harlequin, before it folded.  Like many interesting things Harlequin
>  did internally, never productized.  I don't know what ended up happening
>  to that code.  It wasn't product-worthy, but it was functional at a 
>  prototype level, capable of displaying early web pages with fonting,
>  tables, gifs, etc.  There as no scripting back then and the HTML spec
>  was smaller.  (But a lot of the growth in the HTML spec has been to 
>  clarify questions everyone had to make anyway and to give users access
>  to them... the basic concepts didn't really change.)

How would your browser compare to Gilbert Baumann's Closure as it is
implemented today?

My own understanding of Closure's internals is very limited, but I am
told it supports nearly all of HTML (perhaps with the exception of
forms, at least in the CLIM backend) and CSS 1 (but not CSS 2).

[...]
>  But it wasn't a tasked project, and so it didn't have anyone saying
>  it had to become finished.  What it became was more of a vehicle for
>  debugging myriad CLIM bugs...  

Did you implement CLIM drawing the way you could have used any other GUI
toolkit as a backend, or did CLIM influence the rendering architecture
in any way?  For example, was output recording helpful?


David
From: Kent M Pitman
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <uirbpv2yd.fsf@nhplace.com>
David Lichteblau <···········@lichteblau.com> writes:

> How would your browser compare to Gilbert Baumann's Closure as it is
> implemented today?
>
> My own understanding of Closure's internals is very limited, but I am
> told it supports nearly all of HTML (perhaps with the exception of
> forms, at least in the CLIM backend) and CSS 1 (but not CSS 2).

I don't know of Closure so I can't compare.  As for CSS, I'm tryin to
remember if it existed then.  It might have existed only outboard, as
DSSSL or some such, at the time--influencing concepts but not really
integrated.  I didn't worry about any of that, in the sense of its
being a control stucture that I'd have to confront.  At the level of
DOM data structures, which there was not yet a formal way to talk
about interfacing to, but which one inevitably confronts in
constructing one of these, I'm not sure it's even possible to build
one of these things WITHOUT it having the same basic units of internal
componentry with pretty much the same questions to answer.  I'd been
working with FrameMaker MIF internals (a kind of super-verbose vaguely
XML-like interchange format for FrameMaker documents), for example,
and they all seemed the same, just with different words for the terms.
It's like everyone agreed on the display list issues, everyone agreed
on the fonting issues, etc... They just had different UI's and API's
papered atop things.  Over time, they were destined to converge.  I
think XML/CSS just provided a catalyst for doing that.

> >  But it wasn't a tasked project, and so it didn't have anyone saying
> >  it had to become finished.  What it became was more of a vehicle for
> >  debugging myriad CLIM bugs...  
> 
> Did you implement CLIM drawing the way you could have used any other GUI
> toolkit as a backend, or did CLIM influence the rendering architecture
> in any way?  For example, was output recording helpful?

I don't think it was used for anything other than that its calling
sequence was more natural.  e.g., if you had an anchor, you could just
do output inside a macro that knew to make the contents
mouse-sensitive.  As opposed to explicitly handling callbacks.  (Not
that such things couldn't have been abstracted once you know what
you're doing.)  So yes, I think it used output recording to manage
replay.  But the hard work (the tree reshaping and space allocation)
is done in unfolding the HTML into rectangular form (not that that's
super-hard, but it's where the smarts are).  And once you've done
that, the output recording is helping you only minimally because
you've already got a structure in hand that is most of what you'd need
to use any tool.  I can't recall if I also relied on clipping and
scrolling support from CLIM, but probably; then again, I don't recall
if that was one of the sources of slowness.  This was about 10 years
ago now.

One thing that was a pain was gif decoding.  I wrote my own and I
recall thinking that the GIF algorithm is almost specifically designed
to be vexxing for CL array structures, since it is hard to reallocate
space in the oddly chaotic way that LZW uses them while being
efficient about storage.  In retrospect, I might have been better off
allocating a big array for data to go into and then making an emulated
malloc that could just reserve ("allocate") and unreserve
("deallocate") ranges of the array's memory so that I didn't cons a
lot for that.  Might be I should have just called out to existing
libraries for that, but I was curious how LZW worked and wanted for
some reason of probably foolish pride or unneeded portability or
something for it to all be in Lisp.  It's possible there's just a
simple trick I didn't think of that would have made an elegant
result. But I remember failing to find one, and feeling like there was
something particularly peculiar about the problem that I shuld be
learning something important from.

While I'm on the subject of PostScript (from which this browser came) and
LZW, btw, another odd thing that I recall mulling at the time and not
getting time to track down, but that maybe someone here knows:
I found myself thinking about what it would mean to "compile" PostScript.
I wondered if something LZW-like would provide a decent compiler for
PostScript.  That is, since data and program are unrelated in PostScript,
the notion of compilation is not very meaningful except in terms of 
compactification.  In some sense, what makes a good PostScript program
might be said to be something that just chunks it up into long common
substrings, and LZW (or any compression algorithm--that's just the one I
was looking at) seems to be about locating repeated substrings and finding
them.  So I wondered if, rather thank working at the character level one
worked at the token level, if there was any kind of fun subroutinization
to be had in compiling PostScript.  Maybe it had already been tried, or
maybe it goes nowhere.  But I figured I'd say it out loud just as a thought
in case it triggered anything in anyone else.
From: Frank Buss
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <1fo2cz87yutfu$.1k9onzv8hono1$.dlg@40tude.net>
Kent M Pitman wrote:

> I found myself thinking about what it would mean to "compile" PostScript.
> I wondered if something LZW-like would provide a decent compiler for
> PostScript.  That is, since data and program are unrelated in PostScript,
> the notion of compilation is not very meaningful except in terms of 
> compactification.  

I think "compiling" for PostScript means the same like compiling other
languages, e.g. calculate anything you can do (e.g. substituting "1 2 add"
by "3"), to reduce rendering time:

http://www.tinaja.com/glib/guru68.pdf

-- 
Frank Buss, ··@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
From: =?utf-8?b?R2lzbGUgU8ODwqZsZW5zbWk=?= =?utf-8?b?bmRl?=
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <0n8xc99802.fsf@apal.ii.uib.no>
Kent M Pitman <······@nhplace.com> writes:

> 
> Modularizing the task into something that corrects bad HTML to good 
> and something that displays good HTML is probably the way to go.
> Parsers for bad HTML don't have to know about HTML "meaning", just its
> structure.  

This is in fact what at least one web browser does internally. In an earlier
job I did web browser development, and that web browser first parsed the HTML
into a DOM-tree, and before the tree was sent to the rendering engine it went
through a so called "DOM-fixer". The DOM-fixer basicly was a set of rules to
rewrite bad HTML, so that the rendering engine not had to deal with them.
This rules was constantly rewritten in order to be able to display all the
pages the other guys could display. This code was required for the browser
to be able to show what people expected a browser to be able to display.
I would guess that most web browsers do something similar.


-- 
Gisle Sælensminde, Phd student, Scientific programmer
Computational biology unit, BCCS, University of Bergen, Norway, 
Email: ·····@cbu.uib.no
The best way to travel is by means of imagination
From: Kent M Pitman
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <u4pmxls38.fsf@nhplace.com>
·····@apal.ii.uib.no (Gisle Sælensminde) writes:

> Kent M Pitman <······@nhplace.com> writes:
> 
> > 
> > Modularizing the task into something that corrects bad HTML to good 
> > and something that displays good HTML is probably the way to go.
> > Parsers for bad HTML don't have to know about HTML "meaning", just its
> > structure.  
> 
> This is in fact what at least one web browser does internally. In an earlier
> job I did web browser development, and that web browser first parsed the HTML
> into a DOM-tree, and before the tree was sent to the rendering engine it went
> through a so called "DOM-fixer". The DOM-fixer basicly was a set of rules to
> rewrite bad HTML, so that the rendering engine not had to deal with them.
> This rules was constantly rewritten in order to be able to display all the
> pages the other guys could display. This code was required for the browser
> to be able to show what people expected a browser to be able to display.
> I would guess that most web browsers do something similar.

Thanks much for the data point.

Do you recall how they handled the special case of <b><a>...</b>...</a>?
Did they have special knowledge of <a>...</a> as special in its spanning
abilities, or was there a general rule?  It was the only one I could think
of where you'd need to know something special about the semantics to resolve.

One might argue that old-style <p> should have required some special knowledge,
too, but the answer seems to have been resolved in favor of not really fixing
the intepretation people meant (that is, not trying to find the other end of
the <p> but rather just treating some parts as "not in any <p>" and others as
"in ones they didn't expect to be in".
From: Robert Maas, see http://tinyurl.com/uh3t
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <rem-2007apr26-008@yahoo.com>
> From: "Thomas F. Burdick" <········@gmail.com>
> Face it, HTML is a markup language historically created directly
> by humans, which means you *will* get good content with syntax
> errors by authors who will not fix it.

I'm not talking about occasionally crappy HTML in personal Web
pages. I'm talking about bugs in software that generate the same
crappy HTML millions of times per day, every time anyone anywhere
in the world asks Google to perform a search, the same crappy
mistake in *every* copy of the form emitted by Google's search
engine. Also, the toplevel forms to invoke Google's search engines,
which are fetched via bookmarks or links millions of times per day.
A teensy bit of effort to fix those forms and form-generating
software would fix many millions of Web pages delivered per day.
From: John Thingstad
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <op.tq2z9xa2pqzri1@pandora.upc.no>
On Fri, 20 Apr 2007 14:05:02 +0200, John Thingstad  
<··············@chello.no> wrote:

>> My guess is some smartypants at Google thought it'd make good P.R.
>> to declare the document as XHTML instead of HTML, without realizing
>> that the document wasn't valid XHTML at all, and the DTD used was
>> totally inappropriate for this document. Does anybody know, from
>> eyeballing the entire WebPage source, which DOCTYPE/DTD
>> declaraction would be appropriate to make it almost pass
>> validation? I bet, with the correct DOCTYPE declaraction, there'd
>> be only fifty or a hundred validation errors, mostly the kind I
>> mentionned earlier which I discovered when testing my new parser.

Oh, should mention try the HTML 4.0 traditional stylesheet.


-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
From: John Thingstad
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <op.tq4zeok2pqzri1@pandora.upc.no>
On Fri, 20 Apr 2007 14:05:02 +0200, John Thingstad  
<··············@chello.no> wrote:

> On Fri, 20 Apr 2007 09:48:18 +0200, Robert Maas, see  
> http://tinyurl.com/uh3t <·······@yahoo.com> wrote:
>
> As a ex employee of Opera I can say that writing a Web Browser is hard!
> It is not so much the parsing of correct HTML as the parsing of incorrect
> HTML that poses the problem. Let's face it. It could be simple.
> If we all used XHTML and the browser aborted with a error message
> when a error occurred. Unfortunately that is hardly the case.
> SGML is more difficult to parse. Then there is the fact that many
> cites rely on errors in the HTML being handled just like in
> Microsoft Explorer. I can't count the number of times I heard that Opera
> was broken just to find that it was a HTML error on the web cite that
> Explorer got around.
>

I am a bit reluctant to reply to this one.
Suffice it to day I was warning him about he difficulties.
I don't know or care about the difficulties about creating a web browser.
Of course that is not exactly true, hence the reluctance.

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
From: John Thingstad
Subject: Re: Writing HTML parser wasn't as hard as I thought it'd be
Date: 
Message-ID: <op.tq4zmkrdpqzri1@pandora.upc.no>
On Sat, 21 Apr 2007 15:50:38 +0200, John Thingstad  
<··············@chello.no> wrote:

> On Fri, 20 Apr 2007 14:05:02 +0200, John Thingstad  
> <··············@chello.no> wrote:
>
>> On Fri, 20 Apr 2007 09:48:18 +0200, Robert Maas, see  
>> http://tinyurl.com/uh3t <·······@yahoo.com> wrote:
>>
>> As a ex employee of Opera I can say that writing a Web Browser is hard!
>> It is not so much the parsing of correct HTML as the parsing of  
>> incorrect
>> HTML that poses the problem. Let's face it. It could be simple.
>> If we all used XHTML and the browser aborted with a error message
>> when a error occurred. Unfortunately that is hardly the case.
>> SGML is more difficult to parse. Then there is the fact that many
>> cites rely on errors in the HTML being handled just like in
>> Microsoft Explorer. I can't count the number of times I heard that Opera
>> was broken just to find that it was a HTML error on the web cite that
>> Explorer got around.
>>
>
> I am a bit reluctant to reply to this one.
> Suffice it to day I was warning him about he difficulties.
> I don't know or care about the difficulties about creating a web browser.
> Of course that is not exactly true, hence the reluctance.
>

sorry about the word pun in line two.

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/