From: Xah Lee
Subject: Emacs Lisp Lesson: Text Processing HTML
Date: 
Message-ID: <bb450250-01d5-49ce-83fe-78936998aea2@e23g2000prf.googlegroups.com>
Emacs Lisp Lesson: Text Processing HTML

(HTML version of this article with colors and links here:
http://xahlee.org/emacs/elisp_process_html.html
)

Xah Lee, 2007-11

This page shows a real-world example using emacs lisp to process a
HTML file. If you don't know elisp, first take a gander at Emacs Lisp
Basics.

---------------------------------------------
THE PROBLEM

SUMMARY

I want to write a elisp program, that process a HTML file in a
somewhat complex way. Specifically, certain strings must be replaced
only if they appear inside a tag and or only if they are first child.

DETAIL

I have many web pages that has a Questions And Answers format. Here's
a sample screenshot: Website Question and Answers screenshot.

And here's a example of the raw HTML:

<p class="q">Q: why ...</p>
<p class="a">A: blab 1 ...</p>
<p class="a">blab 2 ...</p>
...
<p class="q">Q: how ...</p>
<p class="a">A: do this 1 ...</p>
<p class="a">do that 2 ...</p>
...

Basically, each Q section is a paragraph of class $B!H(Bq$B!I(B, and each A
section is several <p> tags with class $B!H(Ba$B!I(B.

After a few years with this format, i started to use a better format.
Specifically, the Answers section should just be wrapped with a single
$B!H(B<div class="a"></div>$B!I(B. And, the $B!H(BQ: $B!I(B and $B!H(BA: $B!I(B texts are removed
from content. Here's a example of the new format:

<p class="q">Why ...</p>
<div class="a">
<p>Because ...</p>
<p>You need to do ...</p>
...
</div>

The task i have now, is to transform existing pages to this new
format. Here's what needs to be done precisely:

For any consecutive blocks of $B!H(B<p class="a">...</p>$B!I(B, wrap them with a
$B!H(B<div class="a">$B!I(B and $B!H(B</div>$B!I(B, then replace those $B!H(B<p class="a">$B!I(B by
$B!H(B<p>$B!I(B. Also, remove those $B!H(BQ: $B!I(B and $B!H(BA: $B!I(B.

Although this is simple in principle, but without using a HTML parser,
it's hard to code it as described. Using a HTML parser has its own
problems. The HTML/DOM model would make the code much more complex,
and the output will change the placement of whitspaces. Unless we are
doing XML transformation, the HTML/DOM parser is usually not what we
want. A text-based search-and-replace algorithm to achieve the above
is as follows:

For each occurance of $B!H(B<p class="q">, do the following:

    * Add a $B!H(B<div class="a">$B!I(B right after $B!H(B<p class="q">blab blab ...</
p>$B!I(B.
    * Add a $B!H(B</div>$B!I(B right before $B!H(B<p class="q">$B!I(B
    * Replace $B!H(B<p class="q">Q: $B!I(B by $B!H(B<p class="q">$B!I(B, replace $B!H(B<p
class="a">A: $B!I(B by $B!H(B<p class="a">$B!I(B

Now do:

    * Replace the first occurance of $B!H(B</div>$B!I(B that happens before the
first occurance of $B!H(B<p class="q">$B!I(B.
    * Add a $B!H(B</div>$B!I(B that happens after the last $B!H(B<p class="a">...</
p>$B!I(B tag.
    * Replace all $B!H(B<p class="a">$B!I(B to $B!H(B<p>$B!I(B

We proceed to write a elisp code to solve this problem

---------------------------------------------
SOLUTION

The algorithm described above are based on global text-replacement.
However, since emacs has buffer representation of files with a pointer
that can move back and forth, the algorithm is slightly simplified.

Suppose the file we want to work on is
elisp_process_html_sample.html.zip.

First, we write a prototype that just works on a single file
$B!H(Belisp_process_html_sample.html$B!I(B. Here's the code:

(defun xx ()
  "temp test function"
  (interactive)
  (find-file "elisp_process_html_sample.html")
  (beginning-of-buffer)

;; add opening and closing tags for answer section
;; this is done by locating the opening question tag,
;; then move to the end of tag, then insert <div class="a">
;; then, locate the next opening question tag but move backward to </
p>,
;; then insert </div>
  (while (search-forward "<p class=\"q\">" nil t)
    (search-forward "<p class=\"a\">")
    (replace-match "<div class=\"a\">\n<p class=\"a\">")
    (if (search-forward "<p class=\"q\">" nil t)
        (progn
          (search-backward "</p>")
          (forward-char 4)
          (insert "\n</div>")
          )
      )
    )

;; add the last closing tag for answer section
  (end-of-buffer)
  (search-backward "<p class=\"a\">")
  (search-forward "</p>")
  (insert "\n</div>")

;; take out the $B!H(BQ: $B!I(B and $B!H(BA: $B!I(B and replace $B!H(B<p class="a">$B!I(B by $B!H(B<p>$B!I(B.
  (beginning-of-buffer)
  (while (search-forward "<p class=\"q\">Q: " nil t)
    (replace-match "<p class=\"q\">"))
  (while (search-forward "<p class=\"a\">A: " nil t)
    (replace-match "<p>"))
)

This is a simple code. It uses emac's power of buffer data structure
for files, by moving a pointer back and forth to a desired place, then
do search and replace text or insert. With the ability of moving a
point to a particular string, we are able to locate the places we want
the tag insertion to happen, without explicitly going by the DOM model
of parent-child relationship of tags.

In the above code, the $B!H(Bsearch-forward$B!I(B function moves the cursor to
the end of matched text. It returns $B!H(Bnil$B!I(B if not found. The $B!H(Bsearch-
backward$B!I(B works similarly, but put the point on the beginning of
matched text.

The $B!H(Breplace-match$B!I(B just replaces previously matched text. The $B!H(Bend-of-
buffer$B!I(B moves the point to the end of buffer. Similarly for $B!H(Bbeginning-
of-buffer$B!I(B.

Now, if we want to process many files, first we need to change the
code to take a file path, and add code to save buffer and close
buffer. (file backup are all taken care by emacs automatically) Like
this:

(defun my-process-html (fpath)
  "a better doc string here..."
  (let (mybuffer)
    (setq mybuffer (find-file fpath))
    ; code body here
    (save-buffer)
    (kill-buffer mybuffer)
  )
)

To get the list of files containing the Q and A section, we can simply
use unix's $B!H(Bfind$B!I(B and $B!H(Bgrep$B!I(B, like this: $B!H(Bfind . -name "*\.html" -exec
grep -l '<p class="q">' {} \;$B!I(B

Then, place the list of files into a list and loop over the list, like
this:

(mapcar (lambda (x) (my-process-html x))
        (list
"/Users/xah/web/ClassicalMusic_dir/q.html"
"/Users/xah/web/emacs/elisp_process_html.html"
"/Users/xah/web/emacs/elisp_process_html_sample.html"
"/Users/xah/web/emacs/emacs_adv_tips.html"
"/Users/xah/web/emacs/emacs_display_faq.html"
"/Users/xah/web/emacs/emacs_esoteric.html"
"/Users/xah/web/emacs/emacs_html.html"
"/Users/xah/web/emacs/emacs_n_unicode.html"
"/Users/xah/web/emacs/emacs_unix.html"
"/Users/xah/web/emacs/keyboard_shortcuts.html"
"/Users/xah/web/emacs/modernization.html"
"/Users/xah/web/img/imagemagic.html"
"/Users/xah/web/java-a-day/abstract_class.html"
"/Users/xah/web/sl/build_q.html"
"/Users/xah/web/sl/q.html"
"/Users/xah/web/UnixResource_dir/macosx.html"
"/Users/xah/web/UnixResource_dir/unix_tips.html"
"/Users/xah/web/UnixResource_dir/writ/mshatredfaq.html"
"/Users/xah/web/UnixResource_dir/writ/tabs_vs_spaces.html"
         )
)

The mapcar and lambda is a lisp idiom of looping thru a list. We
evaluate the code and we are all done!

Emacs is beautiful!

  Xah
  ···@xahlee.org
$B-t(B http://xahlee.org/

From: Rainer Joswig
Subject: Re: Emacs Lisp Lesson: Text Processing HTML
Date: 
Message-ID: <joswig-E2EF6D.22153328112007@news-europe.giganews.com>
In article 
<····································@e23g2000prf.googlegroups.com>,
 Xah Lee <···@xahlee.org> wrote:

...

> (defun my-process-html (fpath)
>   "a better doc string here..."
>   (let (mybuffer)
>     (setq mybuffer (find-file fpath))
>     ; code body here
>     (save-buffer)
>     (kill-buffer mybuffer)
>   )
> )

Which one would format like this:

(defun my-process-html (fpath)
  "a better doc string here..."
  (let ((mybuffer (find-file fpath)))
    ; comment
    (save-buffer)
    (kill-buffer mybuffer)))

> 
> To get the list of files containing the Q and A section, we can simply
> use unix's $B!H(Bfind$B!I(B and $B!H(Bgrep$B!I(B, like this: $B!H(Bfind . -name "*\.html" -exec
> grep -l '<p class="q">' {} \;$B!I(B
> 
> Then, place the list of files into a list and loop over the list, like
> this:
> 
> (mapcar (lambda (x) (my-process-html x))
>         (list
> "/Users/xah/web/ClassicalMusic_dir/q.html"

...
>          )
> )
> 
> The mapcar and lambda is a lisp idiom of looping thru a list. We
> evaluate the code and we are all done!

Where the LAMBDA is not needed, since my-process-html is
already a function with one argument.

(mapcar 'my-process-html   ... )
From: Xah Lee
Subject: Re: Emacs Lisp Lesson: Text Processing HTML
Date: 
Message-ID: <8487b8b7-dfd1-49aa-af35-869e14849158@s12g2000prg.googlegroups.com>
Rainer Joswig wrote:
<<
Which one would format like this:
(defun ...
>>

See:
A Simple Lisp Code Formatter
http://xahlee.org/emacs/lisp_formatter.html

Rainer Joswig wrote:
<<Where the LAMBDA is not needed, since my-process-html is already a
function with one argument. (mapcar 'my-process-html ... )>>

you are right. Thanks.

  Xah
  ···@xahlee.org
$B-t(B http://xahlee.org/