From: Kyongho Min
Subject: HTML to Text
Date: 
Message-ID: <3B7768FA.3A96E087@aut.ac.nz>
Dear Lispers,

I wonder if anyone knows an open Lisp code to change a HTML file into a
plain ascii file.
pls recommend me any simple code.

Thanks,

Kyongho Min

From: Raymond Wiker
Subject: Re: HTML to Text
Date: 
Message-ID: <86k808l23q.fsf@raw.grenland.fast.no>
Kyongho Min <···········@aut.ac.nz> writes:

> Dear Lispers,
> 
> I wonder if anyone knows an open Lisp code to change a HTML file into a
> plain ascii file.
> pls recommend me any simple code.

        If all you want is to strip out the HTML markup, you might try
the following:

        Note: strip-xml-markup removes *all* markup, leaving only the
text. Also, strip-xml-markup reads from a string and returns the
stripped string; it should be trivial to rewrite it to operate on
streams. 

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

(defparameter *strip-xml-parse-table*
  #(((#\< . 1) 0)			; 0 - normal state
    ((#\! . 2) 5)			; 1 - after <
    ((#\- . 3) 5)			; 2 - after <!
    ((#\- . 4) 5)			; 3 - after <!-
    ((#\- . 8) 4)			; 4 - comment <!--
    ((#\' . 6)				; 5 - markup
      (#\" . 7)
      (#\> . 0)
      5)
    ((#\' . 5) 6)			; 6 - markup, single-quote
    ((#\" . 5) 7)			; 7 - markup, double-quote
    ((#\- . 9) 4)			; 8 - comment, after -
    ((#\> . 0) (#\- . 9) 4)		; 9 - comment, after --(*)
    ))

(defun lookup-transition (state input)
  (let ((transitions (svref *strip-xml-parse-table* state)))
    (dolist (elt transitions)
      (cond ((atom elt)
	     (return-from lookup-transition elt))
	    ((char-equal input (car elt))
	     (return-from lookup-transition (cdr elt)))))
    (error "State table error")))
	      
(defun strip-xml-markup (string)
  (let ((state 0))
    (with-output-to-string (out)
      (with-input-from-string (in string)
	(do ((char (read-char in nil) (read-char in nil)))
	    ((null char))
	  (let ((new-state (lookup-transition state char)))
	    (when (zerop new-state)
	      (write-char
	       (if (zerop state)
		   char
		   #\Space)
	       out))
	    (setf state new-state)))))))

-- 
Raymond Wiker
·············@fast.no
From: Kyongho Min
Subject: Re: HTML to Text
Date: 
Message-ID: <3B7AF7BA.EFE6E7D7@aut.ac.nz>
Great Thanks for your help.

Raymond Wiker wrote:

> Kyongho Min <···········@aut.ac.nz> writes:
>
> > Dear Lispers,
> >
> > I wonder if anyone knows an open Lisp code to change a HTML file into a
> > plain ascii file.
> > pls recommend me any simple code.
>
>         If all you want is to strip out the HTML markup, you might try
> the following:
>
>         Note: strip-xml-markup removes *all* markup, leaving only the
> text. Also, strip-xml-markup reads from a string and returns the
> stripped string; it should be trivial to rewrite it to operate on
> streams.
>
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>
> (defparameter *strip-xml-parse-table*
>   #(((#\< . 1) 0)                       ; 0 - normal state
>     ((#\! . 2) 5)                       ; 1 - after <
>     ((#\- . 3) 5)                       ; 2 - after <!
>     ((#\- . 4) 5)                       ; 3 - after <!-
>     ((#\- . 8) 4)                       ; 4 - comment <!--
>     ((#\' . 6)                          ; 5 - markup
>       (#\" . 7)
>       (#\> . 0)
>       5)
>     ((#\' . 5) 6)                       ; 6 - markup, single-quote
>     ((#\" . 5) 7)                       ; 7 - markup, double-quote
>     ((#\- . 9) 4)                       ; 8 - comment, after -
>     ((#\> . 0) (#\- . 9) 4)             ; 9 - comment, after --(*)
>     ))
>
> (defun lookup-transition (state input)
>   (let ((transitions (svref *strip-xml-parse-table* state)))
>     (dolist (elt transitions)
>       (cond ((atom elt)
>              (return-from lookup-transition elt))
>             ((char-equal input (car elt))
>              (return-from lookup-transition (cdr elt)))))
>     (error "State table error")))
>
> (defun strip-xml-markup (string)
>   (let ((state 0))
>     (with-output-to-string (out)
>       (with-input-from-string (in string)
>         (do ((char (read-char in nil) (read-char in nil)))
>             ((null char))
>           (let ((new-state (lookup-transition state char)))
>             (when (zerop new-state)
>               (write-char
>                (if (zerop state)
>                    char
>                    #\Space)
>                out))
>             (setf state new-state)))))))
>
> --
> Raymond Wiker
> ·············@fast.no
From: Johannes Beck
Subject: Re: HTML to Text
Date: 
Message-ID: <3B7C1AE5.3B0BF390@arcormail.de>
> > > I wonder if anyone knows an open Lisp code to change a HTML file into a
> > > plain ascii file.
> > > pls recommend me any simple code.
> >
> >         If all you want is to strip out the HTML markup, you might try
> > the following:
> >
> >         Note: strip-xml-markup removes *all* markup, leaving only the
> > text. Also, strip-xml-markup reads from a string and returns the
> > stripped string; it should be trivial to rewrite it to operate on
> > streams.

you've forgot an iron rule of c.l.l: never post complete solutions for
homeworks

Joe
From: Tim Moore
Subject: Re: HTML to Text
Date: 
Message-ID: <9ljmr8$rke$0@216.39.145.192>
On Thu, 16 Aug 2001, Johannes Beck wrote:
> you've forgot an iron rule of c.l.l: never post complete solutions for
> homeworks
> 
> Joe
> 
How else would we amuse ourselves? :)
Tim