Emacs Lisp Lesson: Text Processing HTML
(HTML version of this article with colors and links here:
http://xahlee.org/emacs/elisp_process_html.html
)
Xah Lee, 2007-11
This page shows a real-world example using emacs lisp to process a
HTML file. If you don't know elisp, first take a gander at Emacs Lisp
Basics.
---------------------------------------------
THE PROBLEM
SUMMARY
I want to write a elisp program, that process a HTML file in a
somewhat complex way. Specifically, certain strings must be replaced
only if they appear inside a tag and or only if they are first child.
DETAIL
I have many web pages that has a Questions And Answers format. Here's
a sample screenshot: Website Question and Answers screenshot.
And here's a example of the raw HTML:
<p class="q">Q: why ...</p>
<p class="a">A: blab 1 ...</p>
<p class="a">blab 2 ...</p>
...
<p class="q">Q: how ...</p>
<p class="a">A: do this 1 ...</p>
<p class="a">do that 2 ...</p>
...
Basically, each Q section is a paragraph of class $B!H(Bq$B!I(B, and each A
section is several <p> tags with class $B!H(Ba$B!I(B.
After a few years with this format, i started to use a better format.
Specifically, the Answers section should just be wrapped with a single
$B!H(B<div class="a"></div>$B!I(B. And, the $B!H(BQ: $B!I(B and $B!H(BA: $B!I(B texts are removed
from content. Here's a example of the new format:
<p class="q">Why ...</p>
<div class="a">
<p>Because ...</p>
<p>You need to do ...</p>
...
</div>
The task i have now, is to transform existing pages to this new
format. Here's what needs to be done precisely:
For any consecutive blocks of $B!H(B<p class="a">...</p>$B!I(B, wrap them with a
$B!H(B<div class="a">$B!I(B and $B!H(B</div>$B!I(B, then replace those $B!H(B<p class="a">$B!I(B by
$B!H(B<p>$B!I(B. Also, remove those $B!H(BQ: $B!I(B and $B!H(BA: $B!I(B.
Although this is simple in principle, but without using a HTML parser,
it's hard to code it as described. Using a HTML parser has its own
problems. The HTML/DOM model would make the code much more complex,
and the output will change the placement of whitspaces. Unless we are
doing XML transformation, the HTML/DOM parser is usually not what we
want. A text-based search-and-replace algorithm to achieve the above
is as follows:
For each occurance of $B!H(B<p class="q">, do the following:
* Add a $B!H(B<div class="a">$B!I(B right after $B!H(B<p class="q">blab blab ...</
p>$B!I(B.
* Add a $B!H(B</div>$B!I(B right before $B!H(B<p class="q">$B!I(B
* Replace $B!H(B<p class="q">Q: $B!I(B by $B!H(B<p class="q">$B!I(B, replace $B!H(B<p
class="a">A: $B!I(B by $B!H(B<p class="a">$B!I(B
Now do:
* Replace the first occurance of $B!H(B</div>$B!I(B that happens before the
first occurance of $B!H(B<p class="q">$B!I(B.
* Add a $B!H(B</div>$B!I(B that happens after the last $B!H(B<p class="a">...</
p>$B!I(B tag.
* Replace all $B!H(B<p class="a">$B!I(B to $B!H(B<p>$B!I(B
We proceed to write a elisp code to solve this problem
---------------------------------------------
SOLUTION
The algorithm described above are based on global text-replacement.
However, since emacs has buffer representation of files with a pointer
that can move back and forth, the algorithm is slightly simplified.
Suppose the file we want to work on is
elisp_process_html_sample.html.zip.
First, we write a prototype that just works on a single file
$B!H(Belisp_process_html_sample.html$B!I(B. Here's the code:
(defun xx ()
"temp test function"
(interactive)
(find-file "elisp_process_html_sample.html")
(beginning-of-buffer)
;; add opening and closing tags for answer section
;; this is done by locating the opening question tag,
;; then move to the end of tag, then insert <div class="a">
;; then, locate the next opening question tag but move backward to </
p>,
;; then insert </div>
(while (search-forward "<p class=\"q\">" nil t)
(search-forward "<p class=\"a\">")
(replace-match "<div class=\"a\">\n<p class=\"a\">")
(if (search-forward "<p class=\"q\">" nil t)
(progn
(search-backward "</p>")
(forward-char 4)
(insert "\n</div>")
)
)
)
;; add the last closing tag for answer section
(end-of-buffer)
(search-backward "<p class=\"a\">")
(search-forward "</p>")
(insert "\n</div>")
;; take out the $B!H(BQ: $B!I(B and $B!H(BA: $B!I(B and replace $B!H(B<p class="a">$B!I(B by $B!H(B<p>$B!I(B.
(beginning-of-buffer)
(while (search-forward "<p class=\"q\">Q: " nil t)
(replace-match "<p class=\"q\">"))
(while (search-forward "<p class=\"a\">A: " nil t)
(replace-match "<p>"))
)
This is a simple code. It uses emac's power of buffer data structure
for files, by moving a pointer back and forth to a desired place, then
do search and replace text or insert. With the ability of moving a
point to a particular string, we are able to locate the places we want
the tag insertion to happen, without explicitly going by the DOM model
of parent-child relationship of tags.
In the above code, the $B!H(Bsearch-forward$B!I(B function moves the cursor to
the end of matched text. It returns $B!H(Bnil$B!I(B if not found. The $B!H(Bsearch-
backward$B!I(B works similarly, but put the point on the beginning of
matched text.
The $B!H(Breplace-match$B!I(B just replaces previously matched text. The $B!H(Bend-of-
buffer$B!I(B moves the point to the end of buffer. Similarly for $B!H(Bbeginning-
of-buffer$B!I(B.
Now, if we want to process many files, first we need to change the
code to take a file path, and add code to save buffer and close
buffer. (file backup are all taken care by emacs automatically) Like
this:
(defun my-process-html (fpath)
"a better doc string here..."
(let (mybuffer)
(setq mybuffer (find-file fpath))
; code body here
(save-buffer)
(kill-buffer mybuffer)
)
)
To get the list of files containing the Q and A section, we can simply
use unix's $B!H(Bfind$B!I(B and $B!H(Bgrep$B!I(B, like this: $B!H(Bfind . -name "*\.html" -exec
grep -l '<p class="q">' {} \;$B!I(B
Then, place the list of files into a list and loop over the list, like
this:
(mapcar (lambda (x) (my-process-html x))
(list
"/Users/xah/web/ClassicalMusic_dir/q.html"
"/Users/xah/web/emacs/elisp_process_html.html"
"/Users/xah/web/emacs/elisp_process_html_sample.html"
"/Users/xah/web/emacs/emacs_adv_tips.html"
"/Users/xah/web/emacs/emacs_display_faq.html"
"/Users/xah/web/emacs/emacs_esoteric.html"
"/Users/xah/web/emacs/emacs_html.html"
"/Users/xah/web/emacs/emacs_n_unicode.html"
"/Users/xah/web/emacs/emacs_unix.html"
"/Users/xah/web/emacs/keyboard_shortcuts.html"
"/Users/xah/web/emacs/modernization.html"
"/Users/xah/web/img/imagemagic.html"
"/Users/xah/web/java-a-day/abstract_class.html"
"/Users/xah/web/sl/build_q.html"
"/Users/xah/web/sl/q.html"
"/Users/xah/web/UnixResource_dir/macosx.html"
"/Users/xah/web/UnixResource_dir/unix_tips.html"
"/Users/xah/web/UnixResource_dir/writ/mshatredfaq.html"
"/Users/xah/web/UnixResource_dir/writ/tabs_vs_spaces.html"
)
)
The mapcar and lambda is a lisp idiom of looping thru a list. We
evaluate the code and we are all done!
Emacs is beautiful!
Xah
···@xahlee.org
$B-t(B http://xahlee.org/
In article
<····································@e23g2000prf.googlegroups.com>,
Xah Lee <···@xahlee.org> wrote:
...
> (defun my-process-html (fpath)
> "a better doc string here..."
> (let (mybuffer)
> (setq mybuffer (find-file fpath))
> ; code body here
> (save-buffer)
> (kill-buffer mybuffer)
> )
> )
Which one would format like this:
(defun my-process-html (fpath)
"a better doc string here..."
(let ((mybuffer (find-file fpath)))
; comment
(save-buffer)
(kill-buffer mybuffer)))
>
> To get the list of files containing the Q and A section, we can simply
> use unix's $B!H(Bfind$B!I(B and $B!H(Bgrep$B!I(B, like this: $B!H(Bfind . -name "*\.html" -exec
> grep -l '<p class="q">' {} \;$B!I(B
>
> Then, place the list of files into a list and loop over the list, like
> this:
>
> (mapcar (lambda (x) (my-process-html x))
> (list
> "/Users/xah/web/ClassicalMusic_dir/q.html"
...
> )
> )
>
> The mapcar and lambda is a lisp idiom of looping thru a list. We
> evaluate the code and we are all done!
Where the LAMBDA is not needed, since my-process-html is
already a function with one argument.
(mapcar 'my-process-html ... )
Rainer Joswig wrote:
<<
Which one would format like this:
(defun ...
>>
See:
A Simple Lisp Code Formatter
http://xahlee.org/emacs/lisp_formatter.html
Rainer Joswig wrote:
<<Where the LAMBDA is not needed, since my-process-html is already a
function with one argument. (mapcar 'my-process-html ... )>>
you are right. Thanks.
Xah
···@xahlee.org
$B-t(B http://xahlee.org/