parsing html.

From: John Vert
Subject: parsing html.
Date: Fri, 31 Aug 2001 03:56:16 +0000
Message-ID: <b4b5ad88.0108301956.27c814ff@posting.google.com>

hi,

  i'm trying to come up with a simple way to turn html in lisp-like
notation like the one below,

(setf myhtml '(html (head (title "Hello World")) (body (h1 "Hello
World"))))

into regular html.  i started doing this with the following code: 


(defun lisp2html (src)
  (cond ((null src) nil)
        ((atom (car src))
         (progn
           (if (stringp (car src))
               (format t "~A~%" (car src))
               (format t "<~A>~%" (car src)))
           (parser (cdr src))))
        ((listp (car src)) (parser (car src)))
        (t (parser (cdr src)))))

the problem is that i cannot think of a way to build a recursive
function that would allow me to print the closing tags for each tag
met; i.e. i can print all the opening tags but have no way to, once i
know the tag has ended, to retrieve the tag's name in order to close
it.  i'm hoping someone could guide me in the right way to go about
this, such as basic pseudo code i can follow since i have no idea how
to go about this.  i am not looking for a bullet-proof, feature full
xml parser, but just for the general gist.

thanks, 
 --john

Re: parsing html. Steven M. Haflich
Re: parsing html. Friedrich Dominicus
- Re: parsing html. Chris Riesbeck
  - Re: parsing html. Christophe Rhodes
Re: parsing html. Tim Bradshaw

From: Steven M. Haflich
Subject: Re: parsing html.
Date: Fri, 31 Aug 2001 05:10:37 +0000
Message-ID: <3B8F1C4D.5538FBA9@pacbell.net>

John Vert wrote:

>   i'm trying to come up with a simple way to turn html in lisp-like
> notation like the one below,
> 
> (setf myhtml '(html (head (title "Hello World")) (body (h1 "Hello
> World"))))
> 
> into regular html.  i started doing this with the following code:
> 
> (defun lisp2html (src)
>   (cond ((null src) nil)
>         ((atom (car src))
>          (progn
>            (if (stringp (car src))
>                (format t "~A~%" (car src))
>                (format t "<~A>~%" (car src)))
>            (parser (cdr src))))
>         ((listp (car src)) (parser (car src)))
>         (t (parser (cdr src)))))
> 
> the problem is that i cannot think of a way to build a recursive
> function that would allow me to print the closing tags for each tag
> met; i.e. i can print all the opening tags but have no way to, once i
> know the tag has ended, to retrieve the tag's name in order to close
> it.  i'm hoping someone could guide me in the right way to go about
> this, such as basic pseudo code i can follow since i have no idea how
> to go about this.  i am not looking for a bullet-proof, feature full
> xml parser, but just for the general gist.

You are very confused about one thing.  In usual terminology what
you are trying to write is _not_ a parser.  It is a "generator" or
"printer" or "emitter".

But your realization is correct that recursion is necessary to keep
track of element nesting.  Below is an elegant solution that uses
the CL pretty printer, actually part of a much larger XML printer
that I've written.  Like your attempt above, this simplified version
does not provide for element attributes, nor does it provide the
necessary escapes for #\< and #\& characters that occur in string
content.

The great wits on this newsgroup will no doubt point out, if I don't
do it first, that it is a complete waste to pretty print HTML or XML.
It takes extra time and makes the output larger, while changing the
semantics of the generated output not one bit.  The extra whitespace
emitted by the pretty printer can be turned off, of course, simply by
binding print-pretty* nil.  But during program development, I've found
that emitted html (or xml) without pretty printing can be hellishly
difficult to read and debug.  Pretty printing it greatly facilitates
the human readability of what your program generates.  Your mileage may
vary, but see the last example below.

Here's the code and some simlpe examples of its operation:

<51> (defun pprint-element (element stream)
       (labels ((pprint-element-1 (element stream)
		  (pprint-logical-block (stream nil)
		    (format stream "<~a>" (car element))
		    (pprint-indent :block 2 stream)
		    (loop for content in (cdr element)
			do (pprint-newline :linear stream)
			   (typecase content
			     ;; To do: rewrite with xml entities for #\< and #\&.
			     (string (write-string content stream))
			     (cons (pprint-element-1 content stream))
			     (t (warn "Bogus element content: ~s" content))))
		    (pprint-indent :block 0 stream)
		    (pprint-newline :linear stream)
		    (format stream "</~a>" (car element)))))
	 (pprint-element-1 element stream)
	 (terpri stream)
	 (values)))
pprint-element
<52> (let ((*print-right-margin* 80)
	   (*print-pretty* t))
       (pprint-element '(html (head (title "Hello World"))
			 (body (h1 "Hello World")
			  (p "The " (b "check") " is in the mail.")))
		       *standard-output*))
<html>
  <head><title>Hello World</title></head>
  <body><h1>Hello World</h1><p>The <b>check</b> is in the mail.</p></body>
</html>
<53> (let ((*print-right-margin* 50)
	   (*print-pretty* t))
       (pprint-element '(html (head (title "Hello World"))
			 (body (h1 "Hello World")
			  (p "The " (b "check") " is in the mail.")))
		       *standard-output*))
<html>
  <head><title>Hello World</title></head>
  <body>
    <h1>Hello World</h1>
    <p>The <b>check</b> is in the mail.</p>
  </body>
</html>
<54> (let ((*print-right-margin* 50)
	   (*print-pretty* nil))
       (pprint-element '(html (head (title "Hello World"))
			 (body (h1 "Hello World")
			  (p "The " (b "check") " is in the mail.")))
		       *standard-output*))
<html><head><title>Hello World</title></head><body><h1>Hello World</h1><p>The <b>check</b> is in the mail.</p></body></html>

From: Friedrich Dominicus
Subject: Re: parsing html.
Date: Fri, 31 Aug 2001 05:40:22 +0000
Message-ID: <87u1yo7po9.fsf@frown.here>

········@my-deja.com (John Vert) writes:

> hi,
> 
>   i'm trying to come up with a simple way to turn html in lisp-like
> notation like the one below,
> 
> (setf myhtml '(html (head (title "Hello World")) (body (h1 "Hello
> World"))))
> 
> into regular html.  i started doing this with the following code:
Isn't that re-inventing the wheel? Have you checked out one of
- CL-HTTP
- AllegroServer
- Araneida

?

Regards
Friedrich

From: Chris Riesbeck
Subject: Re: parsing html.
Date: Fri, 31 Aug 2001 19:45:04 +0000
Message-ID: <riesbeck-08EEA7.14450431082001@news.it.nwu.edu>

>········@my-deja.com (John Vert) writes:
>
>> hi,
>> 
>>   i'm trying to come up with a simple way to turn html in lisp-like
>> notation like the one below,
>> 
>> (setf myhtml '(html (head (title "Hello World")) (body (h1 "Hello
>> World"))))
>> 
>> into regular html.  i started doing this with the following code:
>Isn't that re-inventing the wheel? Have you checked out one of

John,

The 3 replies I've seen so far are all correct, but I assume you'd
like to see a simple answer to your simple problem too.

Your input syntax appears to be (tag item ...) where each item might
be either another (tag ...) form or an atom, typically a string, but
we can allow numbers and symbol at no extra cost.

That's a nice simple syntax, and so the recursion is nice and
simple too

  if it's null, quit
  if it's an atom, print it
  else print <car>,
       recursively do each item
       print </car>

In Lisp, 

(defun lisp2html (src)
  (cond ((null src) nil)
        ((atom src) (format t "~A~%" src))
        (t (format t "<~A>~%" (first src))
           (mapc #'lisp2html (rest src))
           (format t "</~A>~%" (first src)))))

From: Christophe Rhodes
Subject: Re: parsing html.
Date: Fri, 31 Aug 2001 20:42:25 +0000
Message-ID: <sqofowdkr2.fsf@athena.jesus.cam.ac.uk>

Chris Riesbeck <········@cs.northwestern.edu> writes:

> >········@my-deja.com (John Vert) writes:
> >
> >> hi,
> >> 
> >>   i'm trying to come up with a simple way to turn html in lisp-like
> >> notation like the one below,
> >> 
> >> (setf myhtml '(html (head (title "Hello World")) (body (h1 "Hello
> >> World"))))
> >> 
> >> into regular html.  i started doing this with the following code:
> >Isn't that re-inventing the wheel? Have you checked out one of
> 
> John,
> 
> The 3 replies I've seen so far are all correct, but I assume you'd
> like to see a simple answer to your simple problem too.

And if you want a stupidly complex, non-portable answer, go to
http://groups.google.com and enter the search terms 

"WARNING lisp kernel fdefn"

I trust that this is enough to warn people with a faint heart of
danger :)

Christophe

From: Tim Bradshaw
Subject: Re: parsing html.
Date: Fri, 31 Aug 2001 10:08:55 +0000
Message-ID: <nkj66b4ee2w.fsf@omega.tardis.ed.ac.uk>

········@my-deja.com (John Vert) writes:

> hi,
> 
>   i'm trying to come up with a simple way to turn html in lisp-like
> notation like the one below,
> 
> (setf myhtml '(html (head (title "Hello World")) (body (h1 "Hello
> World"))))
> 

There are several systems out there which do something this.  You
might want to look at my htout.lisp which lives at
http://www.tfeb.org/lisp/hax.html.  It's not particularly big or fancy
- other systems do a lot more, though I don't have pointers to any to
hand.  Unlike your function it's a macro, which takes some `code'
which is really a lispy representation of HTML and expands it to a
code which will print the HTML.  You could use the utilitiy code there
to make a function like yours, I think. You can intermingle Lisp and
HTML freely which is nice.  It doesn't pretty print (though like Steve
Haflich, I think it is a good thing to do so).

About the only clever thing htout does is to maintain a constantness
bit at macro expansion time.  If the HTML is constant (no embedded
lisp code, basically) it then generates the final string at
macroexpansion time, and the expansion ends up looking like
(WRITE-SEQUENCE ...).  I think this is a neat demonstration of the
kind of things that Lisp macros make possible.

You should look at some of the more complete systems too.

--tim