Small regular expression parser

From: Jeff
Subject: Small regular expression parser
Date: Fri, 23 Jul 2004 01:47:26 +0000
Message-ID: <O6_Lc.7323$eM2.2855@attbi_s51>

While nothing special, I've been working on this for a project of mine
that needs regular expressions. And I didn't want something the size of
CL-PPLRE. So, I've developed my own. I will be putting it online for
those who are interested, but thought I would post here first and guage
interest before going through the hassle. It was developed in Corman
Lisp, but I'm sure it is completely Common Lisp and should compile it
other environments (I hope).

First, the goal was to develop a very simple regular expression parser.
It needed to be reasonably fast, small and handle 95% of all cases that
I would need it for. I decided to follow the example of Lua patterns to
make my life easier. So...

Using a token reader for #/ ... / generates a new regular expression.
Likewise, the function MAKE-REGEX can be used. Pre-defined character
sets are selected using the % character (like Lua) instead of \. There
are inclusive and exclusive sets. Capturing (and nested capturing) as
well as some simple functions to help with basic usage. The entire
package weighs in at ~400 lines (with ~80 of that being comments).
Definitely a testament to Lisp ;)

Some examples of what can be done with it:

(match-regex "Hello, World!" #/%U+/) ==> 1 6

MATCH-REGEX is the heart of all other functions. It returns the index
into the string of the start of the match and the length of the match.
In the above example, searching for 1 or more non-uppercase characters.

(capture-regex "my name is Jeff" #/%u%a+$/) ==> "Jeff"

CAPTURE-REGEX returns the first text string captured by the regular
expression. CAPTURE-ALL-REGEX is the same, but will return a list of
all subsequent matches in the search string.

(capture-all-regex "This is a test" #/%a+/) ==> ("This" "is" "a" "test")

SPLIT-REGEX will split a string into a list of strings that are
delimited by the regular expression:

(split-regex "1,2;3 4-5" #/[,;%s-]+/) ==> ("1" "2" "3" "4" "5")

REPLACE-REGEX and REPLACE-ALL-REGEX can be used to return a new string
with text from the original string that was matched replaced with new
text. There is currently no (built-in) support for accessing the
original text. So (for example), captitalizing every word would take
some extra code, but is possible.

(REPLACE-ALL-REGEX "This Is A Test" #/%u/ "*") ==> "*his *s * *est"

Last there are a couple macros to help with captures. Captures can be
nested, and are returned from MATCH-REGEX as values after the length of
the matched text. The captures are in the order in which they were
closed. The two macros are WITH-MATCH-REGEX and WITH-MATCH-ALL-REGEX.
So, for example:

(with-match-all-regex "Hello, world!" #/(%a+)/ (text)
   (print text)) ==> "Hello" "world"

(with-match-regex "Jeff Massung" #/(%a+)%s((%u)%a+)/
   (first initial last)
   (list initial last first)) ==> ("M" "Massung" "Jeff")

A quick breakdown of how the parser works (for those interested); the
expression string is parsed and broken down into elements (character,
set, eof, etc.). This list of elements is then compiled into a #<Regex>
class which is nothing more than a list of generated functions. Each
function takes an input stream (or match state) as a single argument
and returns either nil or the number of characters successfully
matched. That's it. I'm sure it isn't the fastest method, but was
definitely easier to write and is much easier to maintain ;)

I'd appreciate any comments. This is the beginning of a rather large
project using Lisp, and this was a bit of a journey of learning. Quite
valuable, too. Thanks go to those who have answered my questions here
in the past.

Jeff

From: ·········@random-state.net
Subject: Re: Small regular expression parser
Date: Fri, 23 Jul 2004 11:08:51 +0000
Message-ID: <cdqrk3$3p907$1@midnight.cs.hut.fi>

Jeff <···@nospam.insightbb.com> wrote:

The following comments are shooting from the hip -- just quick thoughts
while reading this.

> Using a token reader for #/ ... / generates a new regular expression.
> Likewise, the function MAKE-REGEX can be used. Pre-defined character

This (having both) is very nice. The only thing I'd also look for the
the more programmatic contruction of regexes:

 (make-regex (list '(:start) string1 pattern string2 '(:end)))

or equivalent.

> sets are selected using the % character (like Lua) instead of \. There

% sounds better then \ (esp. the escaping nightmares it enganders), but
I would have chosen either ~ (like FORMAT) or # (like dispatching
reader-macros) -- less tax on the brain.

Sounds very neat!

Cheers,

 -- Nikodemus                   "Not as clumsy or random as a C++ or Java. 
                             An elegant weapon for a more civilized time."