Elisp Lesson on file processing (make downloadable copy of a website)

From: ······@gmail.com
Subject: Elisp Lesson on file processing (make downloadable copy of a website)
Date: Sat, 05 Jul 2008 10:22:44 +0000
Message-ID: <e042cdff-0c22-409c-be1a-e22e177e6f29@8g2000hse.googlegroups.com>

Another emacs lisp tutorial.

HTML version with links and syntax coloring is at:
 http://xahlee.org/emacs/make_download_copy.html

A plain text version follows.

-------------------

Elisp Lesson: Make Download Archive Of A Website

Xah Lee, 2008-07-04

This page shows a real-life solution of using elisp to create a
downloadable archive of website. If you don't know elisp, first take a
look at Emacs Lisp Basics.

-----------------------
The Problem

SUMMARY

Create a downloadable version of a subdirectory of a website, so that
all relative links in the downloadable version are changed into full
URLs starting with a domain name, and internal directories or files
are removed, etc.

Technically, this lesson gives a example of a text processing function
that modifies a file, and this function is applied to hundreds of
files. How to call OS's command line utilities to copy, delete, and
zip directories. How a function is used to compute the correct link of
a given file path and relative link.

DETAIL

My website has many programing tutorials. For example, a tutorial on
emacs and emacs lisp you are reading, a Perl and Python Tutorial, a
Java Tutorial, and others. Also, there are also math expositions The
Discontinuous Groups of Rotation and Translation in the Plane, and
several annotated classical literatures such as Time Machine,
Flatland. For these projects, i want to create a downloadable copy in
zip, so that people can download them for offline reading.

I can just make a copy of the directory, zip it, and let people
download that. However, this simple solution don't work well because
the these dirs link to css files in the parent directory. If i don't
include the css files, then all style sheet such as for source code
syntax coloring won't be in the downloaded version. Also, all the
links in these html files are relative links, many of which points to
outside of the directory. Links that points to outside the directory
should be a full url link starting with “http://xahlee.org/”, so that
these won't be broken links in the downloaded version. Also, each of
the html file contains a javascript web bug for Google Analytics↗, for
the purpose of gathering web visitor statistics. These javascript
lines needs to be removed in the downloadable copy.

Take this emacs tutorial for example. On my local file system, the
emacs tutorial is at the directory “/Users/xah/web/emacs/”, which
includes my tutorial on emacs and elisp. The downloadable archive is
planned to be at “/Users/xah/web/diklo/xah_emacs_tutorial.tar.gz”,
which should contain the emacs dir, but also the elisp dir at “/Users/
xah/web/elisp” which is the elisp manual, so that any links to the
lisp manual from my emacs tutorial will be maintained in the download
copy.

It would be nice, if i could just press a button in emacs, and have
this archive generated automatically. The script will automatically
copy the directories or files needed, fix all the relative links, take
out unnecessary javascript in the files, remove any emacs backup files
or other garbage files such as Mac's auto-generated “.DS_Store” files,
then zip it. And whenever i have updated my emacs tutorial, i can run
the script again to regenerate a fresh downloadable version.

-----------------------
Solution

The general plan is simple:

    * 1. Copy the directories into a destination directory.
    * 2. Call shell commands to delete temp files such as emac's
backup files in the destination dir.
    * 3. Have a function that process each html, to change relative
links and take out Google Analytics's javascript code.
    * 4. Call shell commands to archive this dir.

First, we define some user input parameters for the script:

;; web root dir
(setq webroot "/Users/xah/web/") ; must end in slash

;; list of source dirs i want to make a archive
;; Each is relative to webroot. Must not end in slash.
(setq sourceDirsList (list "emacs" "elisp"))

;; Destination dir path, relative to webroot
;; This is the dir i want the archived to be at
(setq destDirRelativePath "diklo")

;; dest zip archive name (without the “.zip” suffix)
;; for example here, the download file will be xah_emacs_tutorial.zip
(setq zipCoreName "xah_emacs_tutorial")

;; whether to use gzip or zip.
(setq use-gzip-p nil)

Then, we define some convenient constant.

(setq destRoot (concat webroot destDirRelativePath "/"))
(setq destDir (concat destRoot zipCoreName "/"))

So, destRoot would be like “/Users/xah/web/diklo/” and destDir would
be like “/Users/xah/web/diklo/xah_emacs_tutorial”. The final download
archive would be “/Users/xah/web/diklo/xah_emacs_tutorial.tar.gz”.

Now, we copy the source dirs to destination.

;;; copy to destination
(mapc
 (lambda (x)
   (let (fromDir toDir)
     (setq fromDir (concat webroot x))
     (setq toDir
           (drop-last-slashed-substring
            (concat webroot destDirRelativePath "/" zipCoreName "/"
x)) )
     (make-directory toDir t)
     (shell-command (concat "cp -R " fromDir " " toDir))
     )
   )
 sourceDirsList)

The above code used the function “mapc”. The function has the form
“(mapc 'myfunc mylist)”, where the function myfunc will be applied to
each element of mylist. The function we used above is “(lambda
(x) ...)”, with “x” being the argument. The source dir and dest dir's
paths are generated inside the lambda function, then command line
string is constructed, then “make-directory” is called. It will create
all parent dirs of a given full path. Then, finally we call “shell-
command” to copy the dirs.

Also, we called “drop-last-slashed-substring”, which is defined as
follows:

(defun drop-last-slashed-substring (path)
  "Drop the last path separated by “/”.
For example:
“/a/b/c/d” → “/a/b/c”
“/a/b/c/d/” → “/a/b/c/d”
“/” → “”
“//” → “/”
“” → “”"
  (if (string-match "\\(.*/\\)+" path)
      (substring path 0 (1- (match-end 0)))
    path))

Copying a bunch of directories seems a trivial operation, but it
actually took me a couple hours to arrive at the final code, due to
the exact requirement of directory paths, and the unix “cp” tool's
lack of precision and flexibility.

Originally, i thought the code would be something simple like several
“(shell-command (concat "cp -R " fromDir " " toDir))”, one for each
source dir, where fromDir and toDir are full paths. However, it turns
out the problem is slightly more complex.

If the unix “cp” command is given full paths for both source dir and
dest dir, and the dest dir's parent doesn't exist, it'll actually
create all the parent dirs of the source dir into the destination dir.
For example, if you run “cp -R /a/b/c/emacs /a/b/xahEmacsTut” and
xahEmacsTut doesn't exist already, then you'll end up with “/a/b/
xahEmacsTut/a/b/c/emacs”, not “/a/b/xahEmacsTut/emacs”. However, if “/
a/b/xahEmacsTut” already exists, then the result is “/a/b/xahEmacsTut/
emacs/emacs”, also wrong. To work around this dumb smartness of “cp”,
we first create all the parent directories of the distination dir, but
without the destination dir itself, then, call “cp”.

Directories are abstractly a tree. Copying directories is like
grafting a tree from one branch into another branch. The complexity
arise because we have also to consider what happens when the parent
nodes of the destination branch's spec does not already exist. When we
think of copying a dir mathematically, we can see it's not quite
simple.

The unix “cp” tool's lack of precision can be understood because it is
primarily a tool for manual operation. Typically, it is used to copy a
dir where all the dest dir's parents already exists. We simply “cd” to
the dest dir first.

Now, we copy my site's style sheets.

;; copy the style sheets over, and icons dir
(shell-command (concat "cp /Users/xah/web/lang.css " destDir))
(shell-command (concat "cp /Users/xah/web/lbasic.css " destDir))
(shell-command (concat "cp /Users/xah/web/lit.css " destDir))
(shell-command (concat "cp -R /Users/xah/web/ics " destDir))

Now, do some file cleanup.

; remove emacs backup files, temp files, mac os x files, etc.
(shell-command (concat "find " destDir " -name \"*~\"  -exec rm {} \
\;"))
(shell-command (concat "find " destDir " -name \"#*#\"  -exec rm {} \
\;"))
(shell-command (concat "find " destDir " -type f -name \"xx*\"  -exec
rm {} \\;"))
(shell-command (concat "find " destDir " -type f -name \"\\.DS_Store
\"  -exec rm {} \\;"))
(shell-command (concat "find " destDir " -type f -empty -exec rm {} \
\;"))
(shell-command (concat "find " destDir " -type d -empty -exec rmdir {}
\\;"))
(shell-command (concat "find " destDir " -type d -name \"xx*\" -exec
rm -R {} \\;"))

Now, we need to modify the relative links so that, if a link pointing
to a file that is not part of the downloadable copy, change it to a
“http://xahlee.org/...” based link.

For example, in my emacs tutorial at “/Users/xah/web/emacs/
elisp_htmlize.html” it contains the link “<a href="../perl-python/
index.html">Perl and Python tutorial</a>”, which points to a file
outside the emacs dir. When user download my emacs tutorial, this link
will then points to a file that doesn't exist on his disk. The link
“../perl-python/index.html” should be changed to “http://xahlee.org/
perl-python/index.html”.

Also, in my html files, they contain a javascript for Google
Analytics, like this: “<script src="http://www.google-analytics.com/
urchin.js" type="text/javascript"></script><script type="text/
javascript"> _uacct = "UA-104620-2"; urchinTracker();</script>”. This
allows me to see my web traffic statistics. The downloaded version
shouldn't have this line.

Here's the code to process each html file for the above problems:

;;; change local links to “http://” links.
;;; Delete the google javascript snippet, and other small fixes.
(setq make-backup-files nil)
(require 'find-lisp)
(mapc (lambda (x)
        (mapc
         (lambda (fpath) (clean-file fpath (concat webroot (substring
fpath (length destDir)))))
         (find-lisp-find-files (concat destDir "/" x) "\\.html$"))
        )
      sourceDirsList
)

In the above code, we use “mapc” to apply a function to all html
files. The “find-lisp-find-files” will generate a list of all files in
a dir. Here, we actually calls mapc twice, one inside the other.

The sourceDirsList is a list of dirs. So, the first mapc maps a
function to each of the dir. Now, for each dir, we want to apply a
function to all html files. That's the inner mapc is for. The function
that actually does process the html file is the “clean-file”. The
“clean-file” function takes 2 arguments. The first is the full path to
the html file to be processed, the second is a full path to the “same”
file at source dir. This is needed in order to compute the correct url
for the relative link that needs to be fixed. Here's the code:

(defun clean-file (fpath originalFilePath)
  "Modify the HTML file at fpath, to make it ready for download
bundle.

This function change local links to “http://” links,
Delete the google javascript snippet, and other small changes,
so that the file is nicer to be viewed offline at some computer
without the entire xahlee.org's web dir structure.

The google javascript is the Google Analytics web bug that tracks
 web stat to xahlee.org.

fpath is the full path to the html file that will be processed.
originalFilePath is full path to the “same” file in the original web
structure.
originalFilePath is used to construct new relative links."
  (let (mybuffer bds p1 p2 linkPath linkPathSansJumper)

    (setq mybuffer (find-file fpath))

    (goto-char (point-min)) ;in case buffer already open
    (while (search-forward "<script src=\"http://www.google-
analytics.com/urchin.js\" type=\"text/javascript\"></script><script
type=\"text/javascript\"> _uacct = \"UA-104620-2\"; urchinTracker();</
script>" nil t)
      (replace-match ""))

    (goto-char (point-min))
    (while (search-forward "<a href=\"http://xahlee.org/PageTwo_dir/
more.html\">Xah Lee</a>" nil t)
      (replace-match "<a href=\"http://xahlee.org/PageTwo_dir/more.html
\">Xah Lee↗</a>"))

    ;; go thru each link, if the link is local,
    ;;then check if the file exist.
    ;;if not, replace the link with proper http://xahlee.org/ url
    (goto-char (point-min)) ; in case buffer already open

    (while (search-forward-regexp "<[[:blank:]]*a[[:blank:]]
+href[[:blank:]]*=[[:blank:]]*" nil t)
      (forward-char 1)
      (setq bds (bounds-of-thing-at-point 'filename))
      (setq p1 (car bds))
      (setq p2 (cdr bds))
      (setq linkPath (buffer-substring-no-properties p1 p2))

      (when (not (string-match "^http://" linkPath))

        ;; get rid of trailing jumper, e.g. “Abstract-
Display.html#top”
        (setq linkPathSansJumper (replace-regexp-in-string "^\\([^#]+\
\)#.+" "\\1" linkPath t))

        (when (not (file-exists-p linkPathSansJumper))
          (delete-region p1 p2)
          (let (newLinkPath)
            (setq newLinkPath
                  (compute-url-from-relative-link originalFilePath
linkPath webroot "xahlee.org"))
            (insert newLinkPath))
          (search-forward "</a>")
          (backward-char 4)
          (insert "↗")
          )
        )
      )
    (save-buffer)
    (kill-buffer mybuffer)))

In the above function “clean-file”, the hard part is to construct the
correct URL for a relative link.

Given a file, there are many relative links. The link may or may not
be good in the download copy version. For example, if the relative
link does not start with “../”, then it is still good. However, if it
starts with “../”, it may or may not be still good. For example, in my
emacs tutorial project, both “/Users/xah/web/emacs/” and “/Users/xah/
web/elisp/” are part of the download archive. So, if some file under
the emacs dir has a relative link starting with “../elisp/”, then it
is still a good link. We don't want to replace that with a
“http://...” version. To compute the correct relative link, we
actually need to know the original dir structure.

Computing relative links is conceptually trivial. Basically, each
occurance of “../” means one dir level up. But actually coding it
correctly took a while due to various little issues. For example, some
link will have a trailing jumper of this form “Abstract-
Display.html#top”. The trailing “#top” will need to be removed if we
want to use the string to check if file exists. Theoretically, all it
takes to determine a relative link is the file path of the file that
contains the link, the relative link string, and the dir tree
structure surrounding the file. Specifically, when we move a dir, and
wish to construct or fix relative links, we do not need to check if
the linked file still exists in the new dir. In practice, it's much
simpler, to first determine whether the relative link is still good,
by checking if the linked file exists at the new download copy's dir
structure.

In the clean-file function, it first grab the relative link string
from the html file, then determine whether this link needs to be
fixed, then calls “compute-url-from-relative-link” that returns the
proper “http://” based url. The function compute-url-from-relative-
link takes 4 parameters: fPath, linkPath, webDocRoot, hostName. See
the inline doc below:

(defun compute-url-from-relative-link (fPath linkPath webDocRoot
hostName)
  "returns a “http://” based URL of a given linkPath,
based on its fPath, webDocRoot, hostName.

fPath is the full path to a html file.
linkPath is a string that's relative path to another file,
from a “<a href=\"...\"> tag.”
webDocRoot is the full path to a parent dir of fPath.
Returns a url of the form “http://hostName/‹urlPath›”
that points to the same file as linkPath.

For example, if
fPath is /Users/xah/web/Periodic_dosage_dir/t2/mirrored.html
linkPath is ../../p/demonic_males.html
webDocRoot is /Users/xah/web/
hostName is xahlee.org
then result is http://xahlee.org/p/demonic_males.html

Note that webDocRoot may or may not end in a slash."
  (concat "http://" hostName "/"
          (substring
           (file-truename (concat (file-name-directory fPath)
linkPath))
           (length (file-name-as-directory (directory-file-name
webDocRoot))))))

Finally, we zip up the dest dir.

;; zip the dir
(let (ff)
  (setq ff (concat webroot destDirRelativePath "/" zipCoreName
".zip"))
  (when (file-exists-p ff) (delete-file ff))
  (setq ff (concat webroot destDirRelativePath "/" zipCoreName
".tar.gz"))
  (when (file-exists-p ff) (delete-file ff)))

(setq default-directory (concat webroot destDirRelativePath "/"))

(when (equal
       0
       (if use-gzip-p
           (shell-command (concat "tar cfz " zipCoreName ".tar.gz "
zipCoreName))
         (shell-command (concat "zip -r " zipCoreName ".zip "
zipCoreName))
         ))
  (shell-command (concat "rm -R " destDir))
)

In the above code, first we delete the previous archive if it exists.

Now, all is done. With all the code above in a buffer, i can just eval-
buffer to generate my downloadable archive, or i can call the script
in OS's command line like “emacs --script make_download_copy.el”. I
decided to go one step further, by wrapping the whole script into a
function. Like this:

(defun make-downloadable-copy (webroot sourceDirsList
destDirRelativePath
zipCoreName &optional use-gzip-p)
  "Make a copy of web dir of XahLee.org for download.

This function depends on the structure of XahLee.org,
and is not useful in general.

• webroot is the website doc root dir. (must end in slash)
e.g. “/Users/xah/web/”

• sourceDirsList is a list of dir paths relative to webroot,
to be copied for download. Must not end in slash.
e.g. (list \"p/time_machine\")

• destDirRelativePath is the destination dir of the download.
it's a dir path, relative to webroot.
e.g. “diklo”

• zipCoreName is the downloable archive name, without the suffix.
e.g. “time_machine”

use-gzip-p means whether to use gzip, else zip for the final archive.
If non-nil, use gzip."
  (let (...)
  ;; all the code above here except functions.
  )
)

Here's how i call it:

(make-downloadable-copy
"/Users/xah/web/"
(list "emacs" "elisp")
 "diklo" "xah_emacs_tutorial" "gzip")

(make-downloadable-copy
"/Users/xah/web/"
(list "elisp")
 "diklo" "elisp_manual")

(make-downloadable-copy
"/Users/xah/web/"
(list "java-a-day")
 "diklo" "xah_java_tutorial" "gzip")

(make-downloadable-copy
"/Users/xah/web/"
(list "perl-python")
 "diklo" "xah_perl-python_tutorial" "gzip")

(make-downloadable-copy
"/Users/xah/web/"
(list "js")
 "diklo" "xah_dhtml_tutorial" "gzip")

;; ----------------------
;; math

(make-downloadable-copy
"/Users/xah/web/"
(list "Wallpaper_dir")
 "diklo" "wallpaper_groups")

;; ----------------------
;; literature

(make-downloadable-copy
"/Users/xah/web/"
(list "p/time_machine")
 "diklo" "time_machine")

(make-downloadable-copy
"/Users/xah/web/"
(list "flatland")
 "diklo" "flatland")

;; ...

The whole script is about 200 lines, here: make_download_copy.el.gz.

Emacs is super!


  Xah
∑ http://xahlee.org/

☄

Re: Elisp Lesson on file processing (make downloadable copy of a website) ······@gmail.com
- Re: Elisp Lesson on file processing (make downloadable copy of a website) Sashi
  - Re: Elisp Lesson on file processing (make downloadable copy of a website) ······@gmail.com

From: ······@gmail.com
Subject: Re: Elisp Lesson on file processing (make downloadable copy of a 	website)
Date: Sun, 06 Jul 2008 08:05:09 +0000
Message-ID: <bb06ae49-4977-4769-99ef-d076c6976ace@c65g2000hsa.googlegroups.com>

In this week i wrote a emacs program and tutorial that does archiving
a website for offline reading.
(See http://xahlee.org/emacs/make_download_copy.html )

In the process, i ran into a problem with the unix “cp” utility. I've
been a  unix admin for Solaris during 1998-2004. Even the first time i
learned about cp, i noticed some pecularity. But only today, i thought
about it and wrote it out exactly what's going on.

Here's a excerpt from my emacs tutorial above regarding the issue.
------------------

Copying a bunch of directories seems a trivial operation, but it
actually took me a couple hours to arrive at the final code, due to
the exact requirement of directory paths, and the unix “cp” tool's
lack of precision and flexibility.

Originally, i thought the code would be something simple like several
“(shell-command (concat "cp -R " fromDir " " toDir))”, one for each
source dir, where fromDir and toDir are full paths. However, it turns
out the problem is slightly more complex.

Suppose the source dir is “/Users/xah/web/emacs” and dest dir is “/
Users/xah/web/diklo/xahtut/emacs” and i want all branches under the
source dir copied into the dest dir.

If you do “cp -R /Users/xah/web/emacs /Users/xah/web/diklo/xahtut/
emacs” with both xahtut and xahtut/emacs doesn't exist, then this
error results: “cp: /Users/xah/web/diklo/xahtut/emacs: No such file or
directory”.

If you do “cp -R /Users/xah/web/emacs /Users/xah/web/diklo/xahtut/
emacs” with xahtut/emacs exists, then you got “/Users/xah/web/diklo/
xahtut/emacs/emacs”, which is not what i want.

If you do “cp -R /Users/xah/web/emacs /Users/xah/web/diklo/xahtut”
with xahtut doesn't exist, it results in all branches of emacs in
xahtut, which is wrong.

Only when you do “cp -R /Users/xah/web/emacs /Users/xah/web/diklo/
xahtut” with xahtut exists, you get the correct result with the new
dir “/Users/xah/web/diklo/xahtut/emacs” having all branches of the
original dir.

So, the solution is to first create all the parent dirs of the dest
dir, but without the dest dir node itself. Then, do a cp to the first
parent of dest dir.

Directories are abstractly a tree. Copying directories is like
grafting a tree from one branch into another branch. To begin, we are
given two spec: the source node and a destination node. The source
node by definition is a existing node (i.e. existing dir or file),
otherwise the copying won't make sense. However, the destination spec
can be a node that doesn't exist, or its parents doesn't exist, or
several levels of parents doesn't exist. When the destination spec has
missing nodes, we can consider creating them as part of the grafting
process, or we can consider it as a error.

The unix “cp” tool's behavior is mathematically inconsistent. When the
destination node exist, the source node will become new children of
destination node. When the destination node does not exist (but its
parent exists), “cp” will create the node, and copy only the children
of the source node to it. When the destination node doesn't exist and
its parent doesn't exist neither, then “cp” considers it a error.

---------------
Related readings:

• The Nature of the “Unix Philosophy”
http://xahlee.org/UnixResource_dir/writ/unix_phil.html

  Xah
∑ http://xahlee.org/

☄

From: Sashi
Subject: Re: Elisp Lesson on file processing (make downloadable copy of a 	website)
Date: Thu, 10 Jul 2008 18:35:59 +0000
Message-ID: <7095dc5e-3008-45b8-8665-e746bff84dd6@m36g2000hse.googlegroups.com>

On Jul 6, 4:05 am, ·······@gmail.com" <······@gmail.com> wrote:
> In this week i wrote a emacs program and tutorial that does archiving
> a website for offline reading.
> (Seehttp://xahlee.org/emacs/make_download_copy.html)

Why not use wget or curl?

From: ······@gmail.com
Subject: Re: Elisp Lesson on file processing (make downloadable copy of a 	website)
Date: Fri, 11 Jul 2008 06:32:27 +0000
Message-ID: <0219b84b-d524-4b3e-9031-ac399be322da@m3g2000hsc.googlegroups.com>

Xah Lee wrote:
«
... emacs program and tutorial that does archiving a website for
offline reading. (See http://xahlee.org/emacs/make_download_copy.html
)
»

Sashi wrote:
«Why not use wget or curl?»

The Emacs lisp program makes a archive of parts of website you own, so
that readers of your website can click download to read it offline.

For wget and curl, i have some tips here:
 http://xahlee.org/UnixResource_dir/unix_tips.html

  Xah
∑ http://xahlee.org/

☄