parsing/regexp/match/search

From: Howard R. Stearns
Subject: parsing/regexp/match/search
Date: Thu, 24 Sep 1998 00:00:00 +0000
Message-ID: <360A9A24.AE1BD6A5@elwood.com>

There are frequently posts to this newsgroup requsting parsers, regular
expressin matchers, and other string manipluation utlities.

Of course, CL has an extensive sequence and string library, and the READ
function can be customized in many ways.  Nonetheless, there are some
applications which could benefit from having something more.  

Some code for this purpose can be found at the CMU AI repository, 
 
ftp://ftp.cs.cmu.edu/afs/cs/project/ai-repository/ai/lang/lisp/code/parsing/0.html
including:
  A LALR(1) parser-generator for Lisp.
  Zebu parser/unparser
  Natural language parsing

In addition Henry Baker's "Pragmatic Parsing in Common Lisp" paper at
ftp://ftp.netcom.com/pub/hb/hbaker/Prag-Parse.html includes source code.

There have been other code posted here recently, as well.

I wonder if it is possible to form some consensus, based on actual
experience, that a particular paradigm is broadly applicable and
efficient enough to become the usual suggested starting point.  (Think
in terms of being almost close enough to the universal "right thing" as
to be considered for inclusion in ANSI CL, but leaving out all the
politics and religion associated with deciding whether it should
ACTUALLY be made part of the standard.  Think "thought experiment".)

I'm imagining something which would parse a sequence (not necessarilly a
string) into some set of values based on some template.  The values
might be boundary indexes into the sequence, substrings, numbers,
interned symbols, etc.  (For example, the act of parsing might involve
collecting enough information about, say, numbers, that a number-parser
might as well create the number instead of having to go back over the
string and rexamine some subrange. (Baker gives examples.)

In addition, it may be necessary for the caller to indicate whether the
entire sequence must match the template, or whether, instead, a partial
result can be used with some indication of the failure point being
returned. (Like :junk-allowed, etc.)

Some sample problems from CL itself, for measuring the applicability of
the tools, might include:

- the parsing of individual tokens done by READ, whereby numbers or
symbols are created.
- parse-namestring (for various syntaxes, including URLs)
- other pathname search/replace/merge operations such as are used in
translate-pathname, 
  wild-pathname-p, and the merging of directory components.

Thus the question I pose is two fold:

 1. What is a practical and achievable (but not necssarilly
comprehensive/universal) set of 
    criteria for evaluating such tools?

 2. Which tools measure up well against this criteria?

(I, personally, am not interested in hearing that this falls in the
category of the "universal code walker, and that there is no "one true
solution".  Such opionions will be evident from lack of responses. 
Instead, I'm more interested in positive suggestions, and specific
EXPERIENCE-BASED criticisms of those suggestions.)

I admit that parts of some of the current "paridigms/patterns" books
might address this.  If so, feel free to sumarize the relevent material
here.