html to text

From: Dario Lah
Subject: html to text
Date: Thu, 22 Sep 2005 13:20:44 +0000
Message-ID: <pan.2005.09.22.13.20.41.205114@tis.hr>

Hi,
I hope someone will comment on following code regarding memory 
consumption and speed. What and how can be improved?

I'm using this on SBCL.

It strips down html tags and for script and style tags it removes 
their content also.

Usage is:
(html-file-to-text "some.html")

(defun get-file-content (file)
  "Returns a string with the entire contents of the specified file."
  (with-open-file (input file :external-format :iso-8859-1)
		  (let* ((bytes-to-read (file-length input))
			 (buffer (make-string bytes-to-read)))
		    (read-sequence buffer input)
		    buffer)))

(defun remove-named-tags (html-page &rest html-tag-names)
  "Removes named tags from html page"
  (dolist (tag-name html-tag-names)
    (remove-named-tag html-page tag-name)))

(defun remove-named-tag (html-page html-tag-name)
  "Removes completly named tag with content of tag from html page."
  (let ((tag-start (search (concatenate 'string "<" html-tag-name) 
			   html-page :test #'string-equal))
	(tag-end (search (concatenate 'string "</" html-tag-name) 
			 html-page :test #'string-equal)))
    (if tag-start
	(progn
	  (if tag-end
	      (setf tag-end (+ tag-end (+ (length html-tag-name) 3))))
	  (delete-if #'(lambda (item) item) html-page :start tag-start :end tag-end)
	  (remove-named-tag html-page html-tag-name)))))

(defun remove-html-tags (html-page)
  "Removes everything between tag boundaries < and > from html page."
  (let ((tag-start (position #\< html-page))
	(tag-end (position #\> html-page)))			
    (if (not tag-start)
	html-page
      (progn
	(delete-if #'(lambda (item) item) html-page :start tag-start :end (+ tag-end 1))
	(remove-html-tags html-page)))))

(defun html-to-text (html-page)
  "Removes STYLE and SCRIPT tags completly with their content and for other tags everything between < and > preserving text inside tags."
  (remove-named-tags html-page "style" "script")
  (remove-html-tags html-page))

(defun html-file-to-text (file)
  "Loads file into memory and removes html tags"
  (html-to-text (get-file-content file)))

Best regards,
Dario

From: Edi Weitz
Subject: Re: html to text
Date: Fri, 23 Sep 2005 08:42:50 +0000
Message-ID: <ubr2kw4xh.fsf@agharta.de>

On Thu, 22 Sep 2005 15:20:44 +0200, Dario Lah <····@tis.hr> wrote:

> I hope someone will comment on following code regarding memory
> consumption and speed. What and how can be improved?

Have you really tested this with large HTML pages found "in the wild?"
I haven't but I imagine a couple of issues with your code.  I'll try
to pick some of them:

- It's probably not a good idea to slurp the whole file into memory
  and then work on it.  You should consider using a stream-based
  approach where you read from an input stream and write to an output
  stream like a Unix "filter."

- The way you use DELETE-IF might work by accident but it shows a
  typical "newbie" mistake which is described in the old Lisp FAQ
  here:

    <http://www.faqs.org/faqs/lisp-faq/part3/section-4.html>

- Also, note that instead of #'(LAMBDA (ITEM) ITEM) you can just write
  #'IDENTITY.

- Are you sure REMOVE-HTML-TAGS will work correctly with arbitrary
  HTML comments?  I don't think it will.  Consider this one:

    <!-- Note: Don't use ">" below! -->

- Assuming that all else works it looks like your code will always
  search the whole page from the beginning for the next tag.  This
  doesn't look very efficient.

FWIW, some time ago I had to write a program that did something
similar to what you want to achieve and now that I've seen your
message I've put it online here:

  <http://weitz.de/html-extract/>.

It uses the stream-based approach I describe above and tries to be
liberal enough not to choke on malformed HTML.  Note that it was
originally written for CLISP.  Porting it to another CL implementation
is left as an exercise for the reader... :)

You might also want to look at these:

  <http://opensource.franz.com/xmlutils/index.html>
  <http://www.cliki.net/CL-HTML-Parse>
  <http://www.neilvandyke.org/htmlprag/htmlprag.html>

Cheers,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")