From: Xah Lee
Subject: Lisp Lesson: Regex Replace with a Function
Date: 
Message-ID: <1186713692.857024.306710@x40g2000prg.googlegroups.com>
Lisp Lesson: Regex Replace with a Function
Xah Lee, 2007-08

(A HTML formated version of this article with links is available at
http://xahlee.org/emacs/lisp_regex_replace_func.html
)

This page is emacs lisp lesson solving a real-world problem. It show
how elisp is used in a text pattern replacement using a function that
transforms the matched pattern. You should be familiar with Elisp
Language Basics.

Normally, a programer can write a perl or python script to do find-
replace operation on all files in a dir. (See Perl-Python: Find and
Replace By Regex Text Patterns) However, this process is not
interactive. If you want the find-replace based on case-by-case basis,
then this approach won't work. If you are going to program
interactivity into your script, than it ceases to become a trivial
job. Emacs come to the rescue, by having a build-in feature for
Interactively Find and Replace String Patterns on Multiple Files.

However, suppose you want the replacement string to be a transformed
version of the matched text. This means, instead of constructing the
replacement string using “/1”, “/2” etc, you need to use a function
that returns text for the replacement. This page shows you how by
using a concrete example.

As of today, i have a website with 3319 HTML files. Among the set, it
contains 3276 links to articles at wikipedia.org. Because the site is
written over the years, the link format are not all consistent. For
example, a link to the article on Stanislaw Szukalski might have these
formats:

1. <a href="http://en.wikipedia.org/wiki/
Stanislaw_Szukalski">Stanislaw Szukalski↗</a>
2. <a href="http://en.wikipedia.org/wiki/
Stanislaw_Szukalski">Stanislaw_Szukalski↗</a>
3. <a href="http://en.wikipedia.org/wiki/Stanislaw_Szukalski">http://
en.wikipedia.org/wiki/Stanislaw_Szukalski</a>

In a browser, they would be shown like these:

1. Stanislaw Szukalski↗
2. Stanislaw_Szukalski↗
3. http://en.wikipedia.org/wiki/Stanislaw_Szukalski

I want format 2 and 3 to be replaced with format 1, but not always.
For some pages i want to use full URL as the link text.

For simplicity of this article, let's say i just want to replace
format 2 to format 1 on a case by case basis.

Here, regex cannot do the job by itself because i need the underscore
char “_” in the matched text be replaced by a space “ ”. This means, i
need to write a function that takes the matched text and return a
desired text.

In emacs 22, there's a new feature that allows you to put a elisp
function as your replacement string. This is done by giving the
replacement string this form “ \,(fun-name)”, where fun-name is your
elisp function.

The task here is to write the replacement function.

Let's say our function will be named ff. ff will need to take one
argument, does string replacement of “_” by “ ”, then return the new
text.

The function skeleton would be like this then:

(defun ff ()
  "temp function. Returns a string based on current regex match."
; 1. get the matched text
; 2. transform the matched text
; 3. returns the matched text
)

This is conceptually simple. The hard part is to know how does elisp
in the emacs environment actually get a matched text, and how does
emacs lisp the language do text replacement. Here's the solution:

(defun ff ()
  "temp function. Returns a string based on current regex match."
  (let (matchedText newText)
    (setq matchedText
          (buffer-substring
           (match-beginning 0) (match-end 0)))
    (setq newText
          (replace-regexp-in-string "_" " " matchedText) )
    newText
    )
  )

The (match-beginning 0) and (match-end 0) gives you the beginning and
end positions of the 1st current match. The buffer-substring takes 2
positions and returns the actual text between them. The replace-regexp-
in-string is used to transform the text.

So, with this function written, we can use emacs's Interactively Find
and Replace String Patterns on Multiple Files feature along with our
function ff. The regex pattern to use would be:

>\([_A-z0-9]+\)↗</a>

And the replacement expression would be:

\,(ff)”

This function can be of general use, whenever you need to do regex
replacement where you need some transformation done on the matched
text.

Emacs is beautiful.

Reference: Elisp Manual: Buffer-Contents.

Reference: Elisp Manual: Search-and-Replace.

  Xah
  ···@xahlee.org
∑ http://xahlee.org/