tutorial: generating sitemap with emacs lisp

From: ······@gmail.com
Subject: tutorial: generating sitemap with emacs lisp
Date: Thu, 03 Jul 2008 11:02:04 +0000
Message-ID: <2c22fea2-ab67-41c1-97de-21ee87e23ebd@s33g2000pri.googlegroups.com>

Here's a little tutorial on text processing with emacs lisp.

HTML version with syntax coloring and links is at:
 http://xahlee.org/emacs/make_sitemap.html

plain text version follows:

------------------------------------------
Elisp Lesson: Creating Sitemap

Xah Lee, 2008-07

This page shows how to use elisp to create a sitemap. If you don't
know elisp, first take a look at Emacs Lisp Basics.

--------------------------
THE PROBLEM

SUMMARY

I want to use elisp to create a sitemap. More specifically, generate a
list all files of a given directory including its subdirectories, for
each file create a url string in some particualr XML form, and put the
whole result into a file with proper header and footer texts.

DETAIL

A sitemap is a XML file that lists urls of all files in a website for
web crawlers to crawl. If you are not familiar with it, see Site map↗
and http://www.sitemaps.org/.

A sitemap file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
</urlset>

Each of the “<url>” contains info about a url for the web crawler to
crawl. There can be a thousand or more “<url>”s. The “<loc>” is a URL
of the file. The “<lastmod>”, “<changefreq>”, “<priority>” are
optional.

The purpose of sitemap file is so that web crawlers can get to know
all files that exists on your site, without it having to find out by
the haphazard process of extracting links from pages it happens to
know. This helps web crawling efficiency. Once a crawler knows all
your files, it can then decide which page it actually wants to crawl
for content.

My website xahlee.org has close to 4000 html files. I want to use
elisp to generate a sitemap file.

Some of the files under my website document dir are temp files not
meant for public access. These files or dir's names start with “xx”. I
don't want these included in the sitemap. Also, some files whose
content contains a particular string shouldn't be in the sitemap
neither. So, my elisp program will need to be able to check on the
file's name, and the file's content too.

--------------------------
SOLUTION

First, i define some parameters for the program.

;; full path to web's doc root. Must end in a slash.
(setq webroot "/Users/xah/web/")

;; file name of sitemap file, relative to webroot, without “.xml”
suffix.
(setq sitemapFileName "sitemap")

;; gzip it or not. t for true, nil for false.
(setq gzip-it-p t)

I plan to generate a fresh sitemap regularly since my website have few
new files each week. So, if a sitemap file already exist, i want to
back it up and generate a new one. Here's the code:

; rename file to backup ~ if already exist
(let (f1 f2)
  (setq f1 (concat webroot sitemapFileName ".xml"))
  (setq f2 (concat f1 ".gz"))
  (when (file-exists-p f1)
    (rename-file f1 (concat f1 "~") t)
    )
  (when (file-exists-p f2)
    (rename-file f2 (concat f2 "~") t)
    )
)

Note that the “rename-file” function takes a 3rd argument. If true, it
means just override existing file at the new name.

The next step, is to open a buffer sitemapBuf, insert the sitemap
header tags, then, for each file in my web dir, insert its url into
the sitemapBuf, then add the ending tags, save, then done. Here's the
code:

;; filePath is the full path to the sitemap file
;; sitemapBuf is the buffer of the sitemap file

(let (filePath sitemapBuf)
  (setq filePath (concat webroot sitemapFileName ".xml"))
  (setq sitemapBuf (find-file filePath))

;; insert header tags
  (insert "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">
")

;; for each file in my site, insert its url
  (require 'find-lisp)
  (mapc
   (lambda (x) (my-process-file x sitemapBuf))
   (find-lisp-find-files webroot "\\.html$"))

;; insert ending tag
  (insert "</urlset>")

;; some post processing to add some optional tags
  (goto-char 0)
  (search-forward "http://xahlee.org/Periodic_dosage_dir/pd.html</
loc>")
  (insert "<changefreq>daily</changefreq>")

  (save-buffer)

;; gzip it
  (when gzip-it-p
    (shell-command (concat "gzip " filePath))
    )
)

In the above, first we generate the full path to the sitemap file to
be created. The full path is saved as string in “filePath”. Then we
open the file, effective creating a new buffer. The buffer instance is
saved as the variable sitemapBuf. (note: “buffer” is a elisp datatype,
or a instance of the datatype. Normally we we say “buffer”, we
actually mean the “buffer's content”. )

The interesting part in the above code is the traverse directory
section. The “find-lisp-find-files” line returns a list full path of
all html files. The “mapc” maps a function to each element of the
list. The lambda line is the function that will be applied to each
full path.

So, for example, if a element is “/User/xah/web/emacs/emacs.html”,
then the lambda function will get that as argument, and execute “(my-
process-file "/User/xah/web/emacs/emacs.html" sitemapBuf)”.

The my-process-file is a function that takes a file full path and a
buffer. So that, it can open the file and see whether the file should
be added to the sitemap file. If so, it will add to the sitemapBuf
buffer.

my-process-file is defined this way:

(defun my-process-file (fpath destBuff)
  "process the file at fullpath fpath.
Write result to buffer destBuff."
  (let (fBuf)

    (message fpath) ; show to user what the program is currently doing
    (when (not (string-match "/xx" fpath)) ; skip dir/file starting
with xx
      (setq fBuf (find-file fpath)) ; open file
      (goto-char (point-min))
      (when (not (search-forward "<meta http-equiv=\"refresh\"" nil
"noerror"))
        (with-current-buffer destBuff ; insert url to sitemap buffer
          (insert "<url><loc>")
          (insert (concat "http://xahlee.org/" (substring fpath
(length webroot))))
          (insert "</loc></url>\n")
      ))
      (kill-buffer fBuf) ; close file
      )))

First it checks if the file path contains any “/xx”. On my website,
file names starting with “xx” is meant to be temp files. So, if a file
or dir starts with “/xx”, then skip it.

Otherwise, open the file and check if the file contains a html meta
redirect tag. Google's webmaster guide says Google doesn't like url in
sitemap that points to a file that redirects with a html meta tag. So,
if the html file is a redirect, then don't generate a sitemap url for
it.

Finally, the code calls “(with-current-buffer destBuff ...)” to insert
the proper url tag into the sitemap buffer.

The function “(with-current-buffer ‹buffer› ‹code›)” will temporarily
make the “‹buffer›” the current buffer and execute “‹code›”. When the
execution of “‹code›” is done, the current buffer returns to whatever
it was.

Once we are done with inserting a url into the sitemap, we close the
opened html file by the “kill-buffer” function.

The whole complete code put together is this:

;; 2008-07-02
;; sitemap_generator.el
;; this script generates the sitemap file for xahlee.org
;; see http://en.wikipedia.org/wiki/Site_map
;; http://www.sitemaps.org/

;;;--------------------------------------------------
;;; parameters

;; full path to web's doc root. Must end in a slash.
(setq webroot "/Users/xah/web/")

;; file name of sitemap file, relative to webroot, without “.xml”
suffix.
(setq sitemapFileName "sitemap")

;; gzip it or not. t for true, nil for false.
(setq gzip-it-p t)

;;;--------------------------------------------------

(defun my-process-file (fpath destBuff)
  "process the file at fullpath fpath.
Write result to buffer destBuff."
  (let (fBuf)

    (message fpath)
    (when (not (string-match "/xx" fpath)) ; dir/file starting with xx
are not public
      (setq fBuf (find-file fpath))
      (goto-char (point-min))
      (when (not (search-forward "<meta http-equiv=\"refresh\"" nil
"noerror"))
        (with-current-buffer destBuff
          (insert "<url><loc>")
          (insert (concat "http://xahlee.org/" (substring fpath
(length webroot))))
          (insert "</loc></url>\n")
      ))
      (kill-buffer fBuf)
      )))

;;;--------------------------------------------------
;;; main

; rename file to backup ~ if already exist
(let (f1 f2)
  (setq f1 (concat webroot sitemapFileName ".xml"))
  (setq f2 (concat f1 ".gz"))
  (when (file-exists-p f1)
    (rename-file f1 (concat f1 "~") t)
    )
  (when (file-exists-p f2)
    (rename-file f2 (concat f2 "~") t)
    )
)

(let (filePath sitemapBuf)
  (setq filePath (concat webroot sitemapFileName ".xml"))
  (setq sitemapBuf (find-file filePath))
  (erase-buffer)
  (insert "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">
")

  (require 'find-lisp)
  (mapc
   (lambda (x) (my-process-file x sitemapBuf))
   (find-lisp-find-files webroot "\\.html$"))

  (insert "</urlset>")

  (goto-char 0)
  (search-forward "http://xahlee.org/Periodic_dosage_dir/pd.html</
loc>")
  (insert "<changefreq>daily</changefreq>")
  (save-buffer)

  (when gzip-it-p
    (shell-command (concat "gzip " filePath))
    )
)

You can either run it in a buffer by calling “eval-buffer” or in OS
command line by “emacs --script generate_sitemap.el”.

Emacs is super!

  Xah
∑ http://xahlee.org/

☄