From: Kyongho Min
Subject: HTML to Text
Date: Mon, 13 Aug 2001 05:43:22 +0000
Message-ID: <3B7768FA.3A96E087@aut.ac.nz> Dear Lispers,
I wonder if anyone knows an open Lisp code to change a HTML file into a
plain ascii file.
pls recommend me any simple code.
Thanks,
Kyongho Min From: Raymond Wiker
Subject: Re: HTML to Text
Date: Mon, 13 Aug 2001 09:51:21 +0000
Message-ID: <86k808l23q.fsf@raw.grenland.fast.no> Kyongho Min <···········@aut.ac.nz> writes:
> Dear Lispers,
>
> I wonder if anyone knows an open Lisp code to change a HTML file into a
> plain ascii file.
> pls recommend me any simple code.
If all you want is to strip out the HTML markup, you might try
the following:
Note: strip-xml-markup removes *all* markup, leaving only the
text. Also, strip-xml-markup reads from a string and returns the
stripped string; it should be trivial to rewrite it to operate on
streams.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
(defparameter *strip-xml-parse-table*
#(((#\< . 1) 0) ; 0 - normal state
((#\! . 2) 5) ; 1 - after <
((#\- . 3) 5) ; 2 - after <!
((#\- . 4) 5) ; 3 - after <!-
((#\- . 8) 4) ; 4 - comment <!--
((#\' . 6) ; 5 - markup
(#\" . 7)
(#\> . 0)
5)
((#\' . 5) 6) ; 6 - markup, single-quote
((#\" . 5) 7) ; 7 - markup, double-quote
((#\- . 9) 4) ; 8 - comment, after -
((#\> . 0) (#\- . 9) 4) ; 9 - comment, after --(*)
))
(defun lookup-transition (state input)
(let ((transitions (svref *strip-xml-parse-table* state)))
(dolist (elt transitions)
(cond ((atom elt)
(return-from lookup-transition elt))
((char-equal input (car elt))
(return-from lookup-transition (cdr elt)))))
(error "State table error")))
(defun strip-xml-markup (string)
(let ((state 0))
(with-output-to-string (out)
(with-input-from-string (in string)
(do ((char (read-char in nil) (read-char in nil)))
((null char))
(let ((new-state (lookup-transition state char)))
(when (zerop new-state)
(write-char
(if (zerop state)
char
#\Space)
out))
(setf state new-state)))))))
--
Raymond Wiker
·············@fast.no From: Kyongho Min
Subject: Re: HTML to Text
Date: Wed, 15 Aug 2001 22:29:14 +0000
Message-ID: <3B7AF7BA.EFE6E7D7@aut.ac.nz> Great Thanks for your help.
Raymond Wiker wrote:
> Kyongho Min <···········@aut.ac.nz> writes:
>
> > Dear Lispers,
> >
> > I wonder if anyone knows an open Lisp code to change a HTML file into a
> > plain ascii file.
> > pls recommend me any simple code.
>
> If all you want is to strip out the HTML markup, you might try
> the following:
>
> Note: strip-xml-markup removes *all* markup, leaving only the
> text. Also, strip-xml-markup reads from a string and returns the
> stripped string; it should be trivial to rewrite it to operate on
> streams.
>
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>
> (defparameter *strip-xml-parse-table*
> #(((#\< . 1) 0) ; 0 - normal state
> ((#\! . 2) 5) ; 1 - after <
> ((#\- . 3) 5) ; 2 - after <!
> ((#\- . 4) 5) ; 3 - after <!-
> ((#\- . 8) 4) ; 4 - comment <!--
> ((#\' . 6) ; 5 - markup
> (#\" . 7)
> (#\> . 0)
> 5)
> ((#\' . 5) 6) ; 6 - markup, single-quote
> ((#\" . 5) 7) ; 7 - markup, double-quote
> ((#\- . 9) 4) ; 8 - comment, after -
> ((#\> . 0) (#\- . 9) 4) ; 9 - comment, after --(*)
> ))
>
> (defun lookup-transition (state input)
> (let ((transitions (svref *strip-xml-parse-table* state)))
> (dolist (elt transitions)
> (cond ((atom elt)
> (return-from lookup-transition elt))
> ((char-equal input (car elt))
> (return-from lookup-transition (cdr elt)))))
> (error "State table error")))
>
> (defun strip-xml-markup (string)
> (let ((state 0))
> (with-output-to-string (out)
> (with-input-from-string (in string)
> (do ((char (read-char in nil) (read-char in nil)))
> ((null char))
> (let ((new-state (lookup-transition state char)))
> (when (zerop new-state)
> (write-char
> (if (zerop state)
> char
> #\Space)
> out))
> (setf state new-state)))))))
>
> --
> Raymond Wiker
> ·············@fast.no From: Johannes Beck
Subject: Re: HTML to Text
Date: Thu, 16 Aug 2001 19:11:33 +0000
Message-ID: <3B7C1AE5.3B0BF390@arcormail.de> > > > I wonder if anyone knows an open Lisp code to change a HTML file into a
> > > plain ascii file.
> > > pls recommend me any simple code.
> >
> > If all you want is to strip out the HTML markup, you might try
> > the following:
> >
> > Note: strip-xml-markup removes *all* markup, leaving only the
> > text. Also, strip-xml-markup reads from a string and returns the
> > stripped string; it should be trivial to rewrite it to operate on
> > streams.
you've forgot an iron rule of c.l.l: never post complete solutions for
homeworks
Joe From: Tim Moore
Subject: Re: HTML to Text
Date: Fri, 17 Aug 2001 18:12:56 +0000
Message-ID: <9ljmr8$rke$0@216.39.145.192> On Thu, 16 Aug 2001, Johannes Beck wrote:
> you've forgot an iron rule of c.l.l: never post complete solutions for
> homeworks
>
> Joe
>
How else would we amuse ourselves? :)
Tim