A Proposed Specification for Lisp REGEXP

From: William Deakin
Subject: A Proposed Specification for Lisp REGEXP
Date: Wed, 13 Oct 1999 00:00:00 +0000
Message-ID: <3804A85B.CF9874F1@pindar.com>

Dear All,

I have sat down and looked at what I would like to see in a REGEXP
package for cl. This is an early draft document and comments are
welcome:

1. Project: Regular Expression Pattern Matching and Bindings for Common
LISP

This document describes a proposed description of a regular expression
pattern matching and binding package for Common LISP.

2. Overview:

Following on discussions on the c.l.l. newsgroup a number of ways of
developing a cross-platform and compiler implementation
regular-expression package. This document is an attempt to describe what
would be wanted in this package and to open up a discussion about what
is felt to be an appropriate Lisp syntax. The motivation for this
document is to also try and sort out what would be minimum requirements
for such a package and to give some kind of focus for development of an
efficient implementation (particularly that this is something far beyond
the authors knowledge or expertise).

It is often stated that perl REGEX pattern-matching is superior to those
of cl. However, it is not the intentions of this document to replicate
the string matching capabilities of perl in cl, that is "to out-perl,
perl," but to provided a more general matching and variable-binding
package, that is "to out-lisp, perl."

3. Requirements:

3.1 Syntax:

The syntax will be based on UNIX sed/awk/perl syntax, as follows:

Characters ranges will be matched by:
    .            a single character;
    [p-rt]      range of characters p through r and t;
    [^p-rt]    all characters except p through r and t;

If following the above, the numeric modifiers:
    {n,m}    between n and m occurrences of the above characters;
    {n,}       at least n occurrences of the character(s);
    {,m}      no more that m occurrences of the character(s);
    *           equivalent to {0,};
    +          equivalent to {1,};
    will be allowed.

Position within a string or symbol-name will indicated as:
    ^           start;
    $           end;

In keeping with cl convention, a regular expression must be indicated by
macro-character. See 5.1.

3.2 Bindings:

The range of items to be bound will be indicated by a pair of
macro-characters, see 5.2. This is to mimic the (...) notation in sed or
perl.

Names of variables to be bound will be be passed as a list to the REGEXP
engine and if successful bindings made. This is envisaged as being
similar to the operation of the DESTRUCTURING-BIND macro.

3.3 Scope:

To be of most general use pattern matching will be able to be applied to
all list and sequence structures.

4. Assumptions

4.1 Implementations Restrictions:

There should be no implementation restrictions so as to stop extensions
of the regular-expression engine to allow passing in of functions or to
allow bindings to be made if, for example, a conditional statment is
part of the pattern matching.

5. Unresolved Issues:

5.1 Regular-Expression Macro-Character:

The regular-expression macro-character has to be decided upon. It is
suggested that #! is used.

5.2 Binding Varible Name Macro-Character:

The regular-expression macro-character has to be decided upon. It is
suggested that #< #> is used.

5.4 Replacement:

Another important area of pattern-matching is replacement of matched
items. This has not been discussed but will be addressed in any later
incarnation of this document.

Re: A Proposed Specification for Lisp REGEXP Raymond Toy
- Re: A Proposed Specification for Lisp REGEXP William Deakin
  - Re: A Proposed Specification for Lisp REGEXP Valeriy E. Ushakov
    - Re: A Proposed Specification for Lisp REGEXP William Deakin
Re: A Proposed Specification for Lisp REGEXP Barry Margolin
- Re: A Proposed Specification for Lisp REGEXP Marco Antoniotti
  - Re: A Proposed Specification for Lisp REGEXP William Deakin
- Re: A Proposed Specification for Lisp REGEXP William Deakin
  - Re: A Proposed Specification for Lisp REGEXP Tom Breton
    - Re: A Proposed Specification for Lisp REGEXP William Deakin
  - Re: A Proposed Specification for Lisp REGEXP Barry Margolin
    - Re: A Proposed Specification for Lisp REGEXP William Deakin
      - Re: A Proposed Specification for Lisp REGEXP Stig Hemmer
        Re: A Proposed Specification for Lisp REGEXP William Deakin
        Update to A Proposed Specification for Lisp REGEXP William Deakin
        Re: Update to A Proposed Specification for Lisp REGEXP Howard R. Stearns
        Re: Update to A Proposed Specification for Lisp REGEXP William Deakin
        Re: Update to A Proposed Specification for Lisp REGEXP Paul Tarvydas
        Re: Update to A Proposed Specification for Lisp REGEXP Marco Antoniotti
        Re: Update to A Proposed Specification for Lisp REGEXP Paul Tarvydas
        Re: Update to A Proposed Specification for Lisp REGEXP William Deakin
        Re: Update to A Proposed Specification for Lisp REGEXP Pierre R. Mai
        Re: Update to A Proposed Specification for Lisp REGEXP William Deakin
        Re: Update to A Proposed Specification for Lisp REGEXP Larry Hunter
        Re: Update to A Proposed Specification for Lisp REGEXP William Deakin
        Re: Update to A Proposed Specification for Lisp REGEXP Larry Hunter
        Re: Update to A Proposed Specification for Lisp REGEXP William Deakin
        Re: Update to A Proposed Specification for Lisp REGEXP Stig Hemmer
        Re: Update to A Proposed Specification for Lisp REGEXP Chris Riesbeck
        Re: Update to A Proposed Specification for Lisp REGEXP William Deakin
        Re: Update to A Proposed Specification for Lisp REGEXP Chris Riesbeck
        Re: Update to A Proposed Specification for Lisp REGEXP Lieven Marchand
        Re: Update to A Proposed Specification for Lisp REGEXP William Deakin
        Re: Update to A Proposed Specification for Lisp REGEXP Pierre R. Mai
        Re: Update to A Proposed Specification for Lisp REGEXP Tim Bradshaw
        Re: Update to A Proposed Specification for Lisp REGEXP William Deakin
        Re: Update to A Proposed Specification for Lisp REGEXP William Deakin
        Re: Update to A Proposed Specification for Lisp REGEXP William Deakin
        Re: Update to A Proposed Specification for Lisp REGEXP ········@my-deja.com
        Re: A Proposed Specification for Lisp REGEXP Donald Fisk
        Re: A Proposed Specification for Lisp REGEXP Marco Antoniotti
        Re: A Proposed Specification for Lisp REGEXP William Deakin
Re: A Proposed Specification for Lisp REGEXP Eric Scott
- Re: A Proposed Specification for Lisp REGEXP Rob Warnock
Re: A Proposed Specification for Lisp REGEXP Ray Blaak
- Re: A Proposed Specification for Lisp REGEXP William Deakin
  - Re: A Proposed Specification for Lisp REGEXP Christopher R. Barry
    - Re: A Proposed Specification for Lisp REGEXP Quentin Deakin
    - Re: A Proposed Specification for Lisp REGEXP Tim Bradshaw

From: Raymond Toy
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Wed, 13 Oct 1999 00:00:00 +0000
Message-ID: <4nyad7p6ub.fsf@rtp.ericsson.se>

>>>>> "William" == William Deakin <·····@pindar.com> writes:


    William> 1. Project: Regular Expression Pattern Matching and Bindings for Common
    William> LISP

Have you checked out the regexp package from Sudhir Shenoy?  (I think
you can find it on the MCL archives on digitool.com) It seems to all
of your requirements except for the macro characters and bindings.

Ray

From: William Deakin
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Thu, 14 Oct 1999 00:00:00 +0000
Message-ID: <38058B6A.A2B715AB@pindar.com>

Raymond Toy wrote:

> Have you checked out the regexp package from Sudhir Shenoy?  (I think you can find it
> on the MCL archives on digitool.com) It seems to all of your requirements except for
> the macro characters and bindings.

Thank you for the pointer. I downloaded this last night but I cannot decode it :(

Do you know of any decoders for .hqx format files?

Best Regards,

:) will

From: Valeriy E. Ushakov
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Thu, 14 Oct 1999 00:00:00 +0000
Message-ID: <7u4tq6$kd6$1@news.ptc.spbu.ru>

William Deakin <·····@pindar.com> wrote:

> > Have you checked out the regexp package from Sudhir Shenoy?  (I
> > think you can find it on the MCL archives on digitool.com) It
> > seems to all of your requirements except for the macro characters
> > and bindings.

> Thank you for the pointer. I downloaded this last night but I cannot
> decode it :(

> Do you know of any decoders for .hqx format files?

Some time ago I asked Sudhir if his code could be downloaded in some
other format and he put the package on the web (not linked from his
homepage, though):

| I have placed the two files (source and a test driver) at
|         http://www2.gol.com/users/sshenoy/regexp-1.1.lisp
| and     http://www2.gol.com/users/sshenoy/regexp-test.lisp
| 
| There are bugs in it that I am aware of but have been too lazy to
| fix.  The bugs are identification of groups which include an 'or'
| operator.  However, the entire match is correct.  For example,
| (a|b)*c when applied to "aaabbbbc" will return the group as two
| characters when it should be the whole string (the match will
| correctly return the whole string).  This has to do with the group
| matching variable being a global and not behaving properly during
| backtracking.
| 
| If the above bug doesn't bother you, it should be fine.  (If you can
| provide a fix, I would be most grateful since I can't find the time
| to do it myself).

SY, Uwe
-- 
···@ptc.spbu.ru                         |       Zu Grunde kommen
http://www.ptc.spbu.ru/~uwe/            |       Ist zu Grunde gehen

From: William Deakin
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Thu, 14 Oct 1999 00:00:00 +0000
Message-ID: <3805FB74.E9098D49@pindar.com>

"Valeriy E. Ushakov" wrote:

> | I have placed the two files (source and a test driver) at
> |         http://www2.gol.com/users/sshenoy/regexp-1.1.lisp
> | and     http://www2.gol.com/users/sshenoy/regexp-test.lisp

Thank you very much :) will

From: Barry Margolin
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Wed, 13 Oct 1999 00:00:00 +0000
Message-ID: <VS3N3.819$854.32624@burlma1-snr2>

In article <·················@pindar.com>,
William Deakin  <········@pindar.com> wrote:
>The regular-expression macro-character has to be decided upon. It is
>suggested that #< #> is used.

I suggest #/.../, since most applications that support regular expressions
use /.../ as the default syntax to specify them.

Also, #<...> is already in use in Common Lisp, and is required to signal an
error.  It's used in the printed rep of unreadable objects,
e.g. #<HASH-TABLE ...>.

-- 
Barry Margolin, ······@bbnplanet.com
GTE Internetworking, Powered by BBN, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: Marco Antoniotti
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Thu, 14 Oct 1999 00:00:00 +0000
Message-ID: <lwbta2xves.fsf@copernico.parades.rm.cnr.it>

Barry Margolin <······@bbnplanet.com> writes:

> In article <·················@pindar.com>,
> William Deakin  <········@pindar.com> wrote:
> >The regular-expression macro-character has to be decided upon. It is
> >suggested that #< #> is used.
> 
> I suggest #/.../, since most applications that support regular expressions
> use /.../ as the default syntax to specify them.

I second that.  However, I must stress the point that all the
operators should be available with "standard" (Common) Lisp syntax.
Olin Shriver's SRE is already pretty good, IMHO.

> Also, #<...> is already in use in Common Lisp, and is required to signal an
> error.  It's used in the printed rep of unreadable objects,
> e.g. #<HASH-TABLE ...>.

Yep. #<...> is a no-no.

Cheers

-- 
Marco Antoniotti ===========================================
PARADES, Via San Pantaleo 66, I-00186 Rome, ITALY
tel. +39 - 06 68 10 03 17, fax. +39 - 06 68 80 79 26
http://www.parades.rm.cnr.it/~marcoxa

From: William Deakin
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Thu, 14 Oct 1999 00:00:00 +0000
Message-ID: <38058D99.ED54B089@pindar.com>

Marco Antoniotti wrote:

> Olin Shriver's SRE is already pretty good, IMHO.

I must apologise to you and Olin Shriver's but in my incompetence (I have been on
holiday thinking about this, and was in a large shed somewhere near Penrith, with
no access to a computer) I have not had an opportunity to explore SRE.

With you recommendation I will look at this syntax and update the specification
accordingly,

Best Regards,

:) will

From: William Deakin
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Thu, 14 Oct 1999 00:00:00 +0000
Message-ID: <38058CE0.69AFEA91@pindar.com>

Barry Margolin wrote:

> In article <·················@pindar.com>,
> William Deakin  <········@pindar.com> wrote:
> >The regular-expression macro-character has to be decided upon. It is
> >suggested that #< #> is used.
>
> I suggest #/.../, since most applications that support regular expressions
> use /.../ as the default syntax to specify them.
>
> Also, #<...> is already in use in Common Lisp, and is required to signal an
> error.  It's used in the printed rep of unreadable objects,
> e.g. #<HASH-TABLE ...>.

Thank you are right, I cannot use #< #> :( Hmmm.

However, yhe idea of the #< #> is not to represent the regular expression, but to
delimit the the part of the regular expression to be bound.

For example: to access the binding for the seconf field separated by : you write
/^.*:(.*):/ the bit in the (.*) is then bound to a \1 or $1 character. It is this
that I wanted to use some form of (). But #( #) is already taken (of course).

Are there any obvious paired operators "#[ #]" for example that could be used.

Best Regards,

:) will

From: Tom Breton
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Thu, 14 Oct 1999 00:00:00 +0000
Message-ID: <m3zoxlbwbk.fsf@world.std.com>

William Deakin <·····@pindar.com> writes:

> Barry Margolin wrote:
> 
> > In article <·················@pindar.com>,
> > William Deakin  <········@pindar.com> wrote:
> > >The regular-expression macro-character has to be decided upon. It is
> > >suggested that #< #> is used.
> >
> > I suggest #/.../, since most applications that support regular expressions
[]
> >
> > Also, #<...> is already in use in Common Lisp, and is required to signal an
[]
> 
> Thank you are right, I cannot use #< #> :( Hmmm.
> 

> that I wanted to use some form of (). But #( #) is already taken (of course).
> 
> Are there any obvious paired operators "#[ #]" for example that could be used.

Since you already know about SRE, IMO you should read it before (and
instead of) looking for reader macro character to use.

-- 
Tom Breton, http://world.std.com/~tob
Not using "gh" since 1997. http://world.std.com/~tob/ugh-free.html

From: William Deakin
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Fri, 15 Oct 1999 00:00:00 +0000
Message-ID: <3806ED03.994DDB70@pindar.com>

Tom Breton wrote:

> IMO you should read it before (and instead of) looking for reader macro character
> to use.

I completely agree with you. This is a dead end. Sorry.

:) will

From: Barry Margolin
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Thu, 14 Oct 1999 00:00:00 +0000
Message-ID: <%pmN3.853$854.35050@burlma1-snr2>

In article <·················@pindar.com>,
William Deakin  <········@pindar.com> wrote:
>Barry Margolin wrote:
>
>> In article <·················@pindar.com>,
>> William Deakin  <········@pindar.com> wrote:
>> >The regular-expression macro-character has to be decided upon. It is
>> >suggested that #< #> is used.
>>
>> I suggest #/.../, since most applications that support regular expressions
>> use /.../ as the default syntax to specify them.
>>
>> Also, #<...> is already in use in Common Lisp, and is required to signal an
>> error.  It's used in the printed rep of unreadable objects,
>> e.g. #<HASH-TABLE ...>.
>
>Thank you are right, I cannot use #< #> :( Hmmm.
>
>However, yhe idea of the #< #> is not to represent the regular expression, but to
>delimit the the part of the regular expression to be bound.

I was responding to the wrong paragraph.  #/ should replace what you
proposed to use #! for.

>For example: to access the binding for the seconf field separated by : you write
>/^.*:(.*):/ the bit in the (.*) is then bound to a \1 or $1 character. It is this
>that I wanted to use some form of (). But #( #) is already taken (of course).

Why do you need # in that context at all?  Within the regular expression
you should be able to use traditional regexp syntax.  You could either do
it like Perl and egrep, in which () are special characters, or you could do
it like Emacs Lisp, which uses \ to flag various extended regexp character,
so it would be /^.*:\(.*\):/.

# is only needed to switch from normal Lisp parsing into regexp parsing,
which should gobble up the entire regexp.

-- 
Barry Margolin, ······@bbnplanet.com
GTE Internetworking, Powered by BBN, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: William Deakin
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Thu, 14 Oct 1999 00:00:00 +0000
Message-ID: <3805F6BB.D63A9087@pindar.com>

Barry Margolin wrote:

> # is only needed to switch from normal Lisp parsing into regexp parsing, which
> should gobble up the entire regexp.

You are perfectly correct. I now have had an oportunity to look at the much more
elegant syntax provided by SRE, as implemented by Olin Shivers. This is so much
smarter than what I proposed and will form the basis of what I would like to do.

Whether this is a port from scheme to cl or a rewrite using the same syntax I don't
know yet ;)

What I would like to extend, however, is the way in which  variables are bound. There
is a bit in Olin's excellent specification where he talks about the difficulties of
binding within regular expressions. I agree, but only to a degree. I think this is an
area where Lisp Can Win Big. And maybe more of an issue with scheme than cl. But don't
bother flaming me, you scheme'rs out there, I don't know this and am speculating
widly.

On the basis of SRE understanding I will update the spec and sort something out for
early next week. If anybody is still interested ;)

Best Regards,

:) will

From: Stig Hemmer
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Fri, 15 Oct 1999 00:00:00 +0000
Message-ID: <ekvn1tkmvq7.fsf@jr.pvv.ntnu.no>

William Deakin <·····@pindar.com> writes:
> What I would like to extend, however, is the way in which variables
> are bound. There is a bit in Olin's excellent specification where he
> talks about the difficulties of binding within regular
> expressions. I agree, but only to a degree. I think this is an area
> where Lisp Can Win Big.

You might want to look at Paul Grahams book "On Lisp" where he
addresses this in the context of embedding Prolog in Lisp.

He does several variants, but most of them creates new macros that
expand into LET forms with a supplied body.

This seems to me like the cleanest way of doing things.

Stig Hemmer,
Jack of a Few Trades.

From: William Deakin
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Mon, 18 Oct 1999 00:00:00 +0000
Message-ID: <380ADD1A.87CA060D@pindar.com>

I've been thinking about this:

> For example: a pattern could be specified that would pick out (a (b . c)) (a
> (d (b .c))) (a (d (e (b . c)))) but not (a (d b)  . c).

This is me being dense again. Following on from a mail from Paul Foley, what I
have describe here is more along the lines of a non-deterministic pattern
matching engine :(

This is NOT what I want, otherwise I would be looking at something like a prolog
engine and not a pattern matching one.

Best Regards,

:) will

From: William Deakin
Subject: Update to A Proposed Specification for Lisp REGEXP
Date: Fri, 22 Oct 1999 00:00:00 +0000
Message-ID: <38108A86.32B598B6@pindar.com>

Dear All,

I have sat down (again) and revised the thoughts I had on what I would like to see
in a REGEXP/pattern-matching system for cl. Please note: I have broken the document
into the high level overview and a list of implementations that I am looking at, and
have not included the other stuff. The other stuff is still being worked on and will
be posted to a newsgroup near you as soon as I have finished.

I am still very interested in looking at any other implementations, so if I have
missed some please let me know. Thank you for all your comments on the first draft.
Any further comments would be most welcome.

Best Regards,

:) will

1. Project: Regular Expression Pattern Matching and Bindings for Common LISP

This document describes a proposed system for pattern matching, substitution and
value binding for Common LISP. This will specify what is required of an effective
pattern matching package encompassing existing LISP regular expression matching
implementations; and to be the first step in developing a powerful, flexible and
practical pattern matching tool. It is hoped that this will focus attention,
stimulate discussion and, through dialogue and iteration, reaching consensus on the
syntax and scope of such a system.

It has be stated that perl REGEX pattern matching is superior to that of Common
LISP. This has been used (in part) to justify the removal of Common LISP whilst
retaining of perl from where I work. However, it is not simply the intentions of
this document to replicate the string munging of perl, "to out-perl, perl," but to
provided a more general matching and variable binding package, that is "to out-lisp,
perl."

2. Overview:

Following on from discussions on the comp.lang.lisp newsgroup, the author, in
discussion with others, has started to look at a system for pattern matching,
substitution and value binding. A two-fold approach has been taken: current
implementation are being assessed; and what is required of a matching system is
being considered in discussion with any interested parties. The author can be
contacted at ········@pindar.com, ···········@tesco.net, or posting to
comp.lang.lisp.

This matching pattern matching system will encompass regular expression string
matching, as found in perl/sed/awk. But, due to the much richer selection of data
types found in Common LISP, this will be extended to encompass character, numbers
and sequence data types including vectors, strings and lists.

This document provides an initial attempt at looking of current regular expression
implementations, out-lines appropriate LISP matching syntax and describes what is
wanted in a matching system. Please note, this document is not complete, and the
matching syntax and matching system description will be posted at a later date.

The ultimate goal is to see efficient cross platform and compiler implementations.
It is hoped that this document may provide some kind of focus for development an
efficient implementation (particularly since this is something far beyond the
authors current skill).

3. Current Implementations

I have seen source code for the following regular expression or matching
implementations:

select-match by Stephen Adam;
match by Paul Foley;
nregexp by Lawrence E. Freil;
match by Paul Graham;
matcher by Mark Kantrowitz;
pat-match by Peter Norvig;
regexp-1.1 by Sudhir Shenoy;
SRE by Olin Shivers;

These implementaions are ordered by author name.

Following an, albeit rather hasty, evaluation it is my (and only my rather naive)
opinion that these implementations provide a solid foundation for the development of
a matching package. Unfortunately, they do not provide the full implementation
require. Please note, as this is not what the authors of these packages intended,
this is hardly suprising, and is not intended as a criticism of this work.

From: Howard R. Stearns
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Fri, 22 Oct 1999 00:00:00 +0000
Message-ID: <3810DD72.7EDF1AA3@elwood.com>

William Deakin wrote:
> ...
> This matching pattern matching system will encompass regular expression string
> matching, as found in perl/sed/awk. But, due to the much richer selection of data
> types found in Common LISP, this will be extended to encompass character, numbers
> and sequence data types including vectors, strings and lists.

Be careful.  You might enlarge the scope so much that it becomes
non-trivial to achieve efficient performance on the Perl-like things at
the core of your original issue.

Having said that, I welcome the idea of having a "grand unified regexp
language" that works on elements of a sequence, be it characters in a
string, symbols in a list, etc.  By my way of thinking, the only reason
for not writing my own matcher each time I need one, is to to use a
single common matcher for everything.  If it doesn't do everything I'll
ever need and I have to roll my own once, then I might as well roll my
own all the time.  (That's a little more extreme than what I actually
feel, but you get the idea.)

If you are willing to work at it, there is no reason that you can't make
a matcher that applies a regexp definition to a sequence within some
context such that the code automatically uses efficient access peculiar
to the type of sequence.

My favorite pain-in-rear example of matching/substitution/binding is
pathnames.  Consider, for example, PATHNAME-MATCH-P and
TRANSLATE-PATHNAME on tree structured directories with extended
wildcards such as :wild-inferiors and regexp matching on components. 
(Pathnames have components, including directories, which are themselves
lists of components, with components being either strings with their own
wildcard syntax, or symbols donoting various special things. CLtL2 has
some nice examples.)  Some of this work gets duplicated for parsing
namestrings into pathnames, so it's nice if the syntax specification can
be applied to both matching and parsing.

How all-encompasing do you want to get?

From: William Deakin
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Mon, 25 Oct 1999 00:00:00 +0000
Message-ID: <38142829.286A3D9C@pindar.com>

"Howard R. Stearns" wrote:

> Be careful.  You might enlarge the scope so much that it becomes non-trivial to
> achieve efficient performance...

Yes. True.

> Having said that, I welcome the idea of having a "grand unified regexp language" that
> works on elements of a sequence, be it characters in a string, symbols in a list,
> etc.

Thank you. Good.

> If it doesn't do everything I'll ever need and I have to roll my own once, then I
> might as well roll my
> own all the time.

But, if you had a tool that has sufficient flexibility to allow you to write what you
want...

> How all-encompasing do you want to get?

I don't know, yet. But "through dialogue and iteration," I would like to  reach
"consensus on the syntax and scope of such a system." This is a politician answer but I
really don't know.

Best Regards,

:) will

From: Paul Tarvydas
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Sat, 23 Oct 1999 00:00:00 +0000
Message-ID: <3811E788.57BBBC01@tscontrols.com>

William,

I have a definite interest in the issue of REGEXP's for Lisp.  Here's
some random comments:

1) The Icon/SNOBOL people know the most about pattern matching.  It
   might be worth a look at Icon to see if it generates ideas on how
   to lispify regexps.

2) If you want to do better than perl, you have to allow for
   convenient multi-line matches - this enables parsing.

3) I currently think that parsers (yacc, etc.) should be regexp-based.
   The idea of applying regexps to sequences of non-characters is, of
   course, vital.  In my opinion, regexps need at least these extra
   concepts to enable parsing:

    a) the ability to specify "noise" tokens - items in a sequence
       that will be automatically skipped over during a match
       (e.g. whitespace, comments, etc)

    b) the ability to specify simple state machines

    c) the ability to invoke other regexps from within a regexp (not
       exactly a stretch in Lisp-ish thinking :-).



I've been experimenting with exactly these concepts.  At the risk of
over-lengthifying this note with information that may not apply to
your goals, here's a brief description of what I've got:

The primitive sequence element is a

    (defstruct token kind data).

The :kind field is an arbitrary symbol.  The :data field contains
arbitrary information related to the token (for example the actual
character value of a 'char token).

Matching functions return these multiple values:

  - a success flag (t or nil, remarkably)

  - a sequence representing the matched portion of the input

  - a sequence representing the remaining portion of the input
    following the match

  - an assoc list of bindings (name vs. matched sequence)

Matching functions take these parameters as input:

  - a pattern (a lisp list)

  - the input list of tokens

  - a list of token kinds to be skipped during matching
    (skipping means "advance match by one element, do not
    advance pattern")

  - a match accumulator

  - an assoc list of labels vs. pattern-bits (implements state
    matching)

  - an assoc list of the current set of bindings.

Matching functions are combined functionally - a successful match
invokes the next match in the sequence with arguments appropriately
advanced: the pattern has been "cdr'ed", the input has been advanced,
the match accumulator has had the most recent match appended to it and
new bindings have been cons'ed onto the binding list.

The primitive matching operations on sequences of tokens are:

(tok <kind> <value>) - match a single token exactly or fail

(kind <k>)      - match a single token of kind "k"

(seq (...))      - match a sequence of patterns, i.e. do not allow
                       intervening skipped tokens (for example, if you
                       are matching a legal "C" identifier, you don't
                       allow whitespace skipping (where skipping means
                       "added to the match"))

(any (...))      - match any one of the given patterns (Icon-esque)

(many (...))         - regexp "+" - match one or more of the given
         patterns (Icon-esque)

(many-optional (...))- regexp "*" - match zero or more

(func <id>)          - recursively invoke pattern named "id" (in Lisp,
         I've implemented "named" patterns as functions)

(upto (...))      - match until given pattern succeeds,
         non-inclusive (Icon-esque)

(upto-i (...))      - same as upto, but inclusive

(labels ((L1 (...)) (L2 (...)) ...))

              - This contains a state-based match pattern.

       - Continue matching within an environment
         containing the given named patterns (L1, L2
         ...).  Matching commences with the first named
         pattern (in this case L1).

(choice ((...p1...) T1) ((...p2...) T2) ... ((*) Tn))

       - Continuation of state-based matching.

       - Like "cond" for matching - a set of
                       pattern-label pairs; select first successful
                       match, then continue matching at the pattern
                       named by the label.  For example, if p2
                       matches, then matching will continue with the
                       pattern named "T2" (selected from the
                       environment created by a "labels" operation).
                       The "*" pattern always matches and acts as an
                       "otherwise" clause for the choice.

(?= <name> (...))    - binding: if the pattern (...) matches, then
         bind the matched sequence to "name"

(string "xyx")      - same as (seq (tok char #\x) (tok char #\y) (tok
         char #\z)

(bal (...left-p...) (...right-p...))

       - match a pattern bounded by left-p on the left
                       and right-p on the right, inclusively and
                       recursively.


EXAMPLES:

(in my world, all text is converted into tokens of kind "char")

A C++ comment is:

   ((seq (tok char #\/) (tok char #\/)) (upto (tok char #\Newline)))

or

   ((string "//") (upto (tok char #\Newline)))

i.e. two slashes with no intervening whitespace followed by anything
except a newline; the newline is not consumed and remains at the head
of the input stream or \/\/.*$ in regexp-ese.


A run of whitespace is:

   ((many (tok char #\Space) (tok char #\Tab)))

i.e. one or more spaces or tabs. [ \t]+ in regexp-ese.


A keyword (in this case "typedef") can be matched with:

   (string "typedef").


A C compound statement could be loosely matched by:

   (bal (string "{") (string "}")).

[I don't believe that this can be done in regexp-ese.]

A C string is:

   ((labels
      (start  (tok char #\") (choice ((tok char #\") string-end)
                                     ((tok char #\\) escape)
                                     ((*) string-chars)))
      (string-chars  (kind char) (choice ((tok char #\") string-end)
                                         ((tok char #\\) escape)
                                         ((*) string-chars)))
      (string-end  (tok char #\"))
      (escape  (tok char #\\) (choice ((*) string-chars)))))

i.e. a state-based match.  It begins by matching a dquote (").  If it
immediately finds another dquote, it succeeds and consumes the
trailing dquote (in "string-end").  If it encounters a backslash, it
consumes an escaped character or octal/decimal number (in this
implementation, I don't care to know what the octal/decimal value is -
this lets me simplify the problem - the only escaped entity that
really matters is \", hence, the simplified state "escape").
Otherwise, the matching state becomes "string-chars" where any
character is consumed (matched) and the cycle repeats.  [I don't
believe that this can be done in regexp-ese.]



TREE REPLACEMENT

I use these matches inside of macros that match, then replace whole
sequences.  For example, I filter all whitespace into whitespace
tokens with:

(defun replace-whitespace (lis)
  (replace-all
   ((many (tok char #\Space) (tok char #\Tab)))
   (make-token :kind 'whitespace :data ?matched)
   lis nil))

(the macro automagically binds the symbol ?matched to the full
match).

I hope that this has been helpful.

Paul Tarvyds
·········@tscontrols.com

From: Marco Antoniotti
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Sun, 24 Oct 1999 00:00:00 +0000
Message-ID: <lwg0z1gk9m.fsf@copernico.parades.rm.cnr.it>

Paul Tarvydas <·········@tscontrols.com> writes:

> William,
> 
> I have a definite interest in the issue of REGEXP's for Lisp.  Here's
> some random comments:
> 
> 1) The Icon/SNOBOL people know the most about pattern matching.  It
>    might be worth a look at Icon to see if it generates ideas on how
>    to lispify regexps.
> 
> 2) If you want to do better than perl, you have to allow for
>    convenient multi-line matches - this enables parsing.

No it does not, unless you restrict the meaning of the term 'parsing'
as it has been used in many years in the compiler and text processing
field.  (Of course, it may instead be the case, that my English were
too limited).

> 
> 3) I currently think that parsers (yacc, etc.) should be
     regexp-based.

Forgive my bluntness, but I believe a re-reading of the white book or
of the Dragon book is in order :)

As a matter of fact, what is really needed (and something which has
appeared in some of the most modern 'parsing languages' in the Java
world and also in Zebu) is to have a single language (grammars) to
specify Regular and (Extended) Context Free recognizers and
generators.  The "regexp" string based representation, common in the
UN*X world, should be an abbreviation.

I skip the rest of the message, which is interesting indeed.  Note
that your 'labels' construct (beside overloading a Common Lisp special
form: not a thing I like too much) is essentially a wrapper for a
'rule' definition form.

Cheers

-- 
Marco Antoniotti ===========================================
PARADES, Via San Pantaleo 66, I-00186 Rome, ITALY
tel. +39 - 06 68 10 03 17, fax. +39 - 06 68 80 79 26
http://www.parades.rm.cnr.it/~marcoxa

From: Paul Tarvydas
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Mon, 25 Oct 1999 00:00:00 +0000
Message-ID: <3813D286.B9DBD4D3@tscontrols.com>

Marco Antoniotti wrote:

Thanks for your comments!

> >> 2) If you want to do better than perl, you have to allow for
> >>    convenient multi-line matches - this enables parsing.
>
> >No it does not, unless you restrict the meaning of the term 'parsing'
> >as it has been used in many years in the compiler and text processing
> >field.  (Of course, it may instead be the case, that my English were
> >too limited).
>

It is true.  The word "parsing" to me means "compiler-like" parsing(*).

What I was trying to say is that there is a gray area between lexical
"scanning" and compiler-like "parsing".  Most practitioners consider
lexical scanning to be "easy" and compiler-like parsing to be "difficult"
(i.e. too much trouble, requiring too much thought (!)).  Perl makes "easy"
lexical scanning more powerful and almost, but not quite, as powerful as
compiler-like parsing.

I argue that adding the features I suggested (skipping noise tokens,
multi-line matching and simple-state based matching) to lexical scanning
(i.e. lex, Perl and regexps) would fill much of the gray area.

Lisp combined with such pattern-matching would make an amazingly powerful
text-manipulation tool.

> >Forgive my bluntness, but I believe a re-reading of the white book or
> >of the Dragon book is in order :)

:-).  The last time (admittedly over a decade ago) that I consumed these
books, they made a distinction between lexical analysis and parsing.  I
haven't seen a commercially-usable tool that uses the *same* notation to
describe lexical analysis and parsing (some languages combine both, but
with a different notation for each).   [This might mean that I haven't
looked hard enough :-].

> >As a matter of fact, what is really needed (and something which has
> >appeared in some of the most modern 'parsing languages' in the Java
> >world and also in Zebu) is to have a single language (grammars) to
> >specify Regular and (Extended) Context Free recognizers and
> >generators.  The "regexp" string based representation, common in the
> >UN*X world, should be an abbreviation.

I agree.  This is what I want.  Can you give me some pointers?

The problem I'm working on involves languages with mixed syntaxes.  I'm
trying to translate graphical notations, containing code snippets, into
compilable text.  I don't want to, nor need to parse and analyze both
languages.  I want to peel off the graphical semantic information and use
it to generate code that wraps the code snippets into legal programs.  The
graphical syntax is automatically generated and therefore does not need to
be subjected to a full compiler-like semantic analysis and the embedded
code snippets are checked further downstream.  Just to make life
interesting, every now and then, I need to "reach into" a code snippet and
alter a bit of code.

To be concrete, the code snippets are written in C and C++.  The
graphically-emitted wrapper syntax is a nice LL(0) grammar  (Ugh!  It has
been a long time since I've read those books.  Whatever the correct
terminology is, what I mean is that the wrapper syntax can be easily parsed
in a top-down manner with no lookahead).

Perl and regexps can't do this job because they can't (easily) match
arbitrary sequences of tokens that span many lines.

Yacc (state-based parsing) and all of its derivatives are too "heavy
weight" - I would be required to write a grammar that accepted the union of
both languages.  Just writing a usable C++ grammar in yacc would be enough
to drive me mad, but mixing it with another syntax would turn the rest of
my hair gray.  Bleah.

In addition, I want to keep the original code snippets "intact", i.e. I
don't want to lose formatting, whitespace and newline information
(state-based parsers tend to destroy whitespace).

In fact, it's not clear that I'm working with "context free grammars".  I
probably want a context-dependent grammar :-).

> >I skip the rest of the message, which is interesting indeed.

If you found the rest of my messagte interesting but didn't like
introduction, then it's *my* English that's too limited :-).

> >Note that your 'labels' construct (beside overloading a Common Lisp
> special form: not a thing I like too much)

Yes, I agree.  The only thing that the name "labels" has going for it is
that it immediately communicated my architectural intention to another
lisp'er.  At the moment, the code is implemented as a cheap interpreter, so
it cannot conflict with Lisp, but if I refactored the code, I would
certainly have to get rid of the name clash.  It's the idea that counts,
anyway :-)...

> >is essentially a wrapper for a 'rule' definition form.

I don't understand - I've been away from Lisp for quite a while, mired in
8051 assembler programming (and 8051 graphical programming :-) for too
long.  Can you explain in more detail what you mean?

>

Thank you again
Paul Tarvydas

(*) Unfortunately as far as I can tell, the word "parse" has, at best, the
meaning I've used (compiler-like).  Like "visual programming" (which should
mean "programming using a graphical syntax", instead of "an IDE that lets
you build widget-based programs") the term "parse" has been given a
concrete (and very limiting) meaning by marketing departments of companies
that sell yacc-derivative tools.

From: William Deakin
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Mon, 25 Oct 1999 00:00:00 +0000
Message-ID: <38146A45.1EA54E36@pindar.com>

Paul Tarvydas wrote:

> I argue that adding the features I suggested ... to lexical scanning
> (i.e. lex, Perl and regexps) would fill much of the gray area.

Yup. I think this is true.

> Lisp combined with such pattern-matching would make an amazingly powerful
> text-manipulation tool.

And so is this lots.

> Just writing a usable C++ grammar in yacc would be enough to drive me mad,
> but mixing it with another syntax would turn the rest of my hair gray.
> Bleah.

Ouch. I have tried to do similar things. Ouch again.

> In fact, it's not clear that I'm working with "context free grammars".  I
> probably want a context-dependent grammar :-)

Hmmm. I not sure I can help with this one. But if the tools were there to do
the matching, you could then paste context-depence on top. Although from the
little I know (which isn't much) context-dependent grammars are rare.

Thank you for your comments.

Best Regards,

:) will

From: Pierre R. Mai
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Mon, 25 Oct 1999 00:00:00 +0000
Message-ID: <87aep7pg4g.fsf@orion.dent.isdn.cs.tu-berlin.de>

William Deakin <·····@pindar.com> writes:

> Hmmm. I not sure I can help with this one. But if the tools were there to do
> the matching, you could then paste context-depence on top. Although from the
> little I know (which isn't much) context-dependent grammars are rare.

Context-dependent grammars aren't really that rare in principle.  It's
just that most cool (i.e. fast and simple) parsing techniques and
corresponding parser generators only work for (subsets of)
context-free grammars, so that one tries really hard to get a
context-free grammar for the language you are interested in parsing,
even if that involves extensive rewriting (beyond the point of easy
readability), hacks like manual disambiguation or moving the
context-dependent parts into later stages, and/or even changing the
language to include additional keywords in just the right places.

Yet many languages are really described most concisely and
understandably using context-dependend grammars.  This isn't even
really surprising, since programming languages are often trying to be
the most human-agreeable, while still remaining easy to process for
computers.  And humans are very good at handling context-dependencies,
whereas computers aren't traditionally.  So there is a tendency for
languages to push the boundaries of the usual parsing techniques.

Regs, Pierre.

-- 
Pierre Mai <····@acm.org>         PGP and GPG keys at your nearest Keyserver
  "One smaller motivation which, in part, stems from altruism is Microsoft-
   bashing." [Microsoft memo, see http://www.opensource.org/halloween1.html]

From: William Deakin
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Tue, 26 Oct 1999 00:00:00 +0000
Message-ID: <381565E0.2F2107C1@pindar.com>

Pierre R. Mai wrote:

> Will writes:
> > from the little I know (which isn't much) context-dependent grammars are rare.
>
> Context-dependent grammars aren't really that rare in principle.

Yes. True. It's just in my limited experience I have yet to come across many (any)
context-dependent grammars.

> It's just that most cool (i.e. fast and simple) parsing techniques...only work
> for (subsets of) context-free grammars,

Or at least parsing techniques that take context-dependant grammar and mung them
into context-free format. But this is what you are saying, so I will shu up.

> so that one tries really hard to get a context-free grammar...even if that
> involves extensive rewriting (beyond the point of easy readability), hacks like
> manual disambiguation...

What an amazing word :)

> ...or moving the context-dependent parts into later stages, and/or even changing
> the language to include additional keywords in just the right places.

Yup. Agreed.

> Yet many languages are really described most concisely and understandably using
> context-dependend grammars.

This is interesting. If I remember right, it was a while ago, a conclusion of a
thread on c.l.l. about natural languages was that most natural languages,
including English, Cymraeg &c, but maybe not German or Latin (or at least part of
these languages anyway) are context free. But I agree, this does not make these
grammars easy to understand.

> This isn't even really surprising, since programming languages are often trying
> to be the most human-agreeable, while still remaining easy to process for
> computers.  And humans are very good at handling context-dependencies, whereas
> computers aren't traditionally.

Are they? I know humans are good at handling context-dependant meaning, that is
what it, this and that are all about, but do they find context dependent grammar's
easy? I just say this because I don't know.

>  So there is a tendency for languages to push the boundaries of the usual
> parsing techniques.

Yes.

Best Regards,

:) will

From: Larry Hunter
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Tue, 26 Oct 1999 00:00:00 +0000
Message-ID: <3is3duyyzeh.fsf@work.nci.nih.gov>

William Deakin writes:

  > Context-dependent grammars aren't really that rare in principle.

  Yes. True. It's just in my limited experience I have yet to come across
  many (any) context-dependent grammars.

It's actually amazing how many fairly simple concepts require
context-sensitive (prefered name over "dependent") grammars.  Probably the
best known example is the "copy language", that is, any string followed by a
copy of itself.  Another example is "a string consisting of a number of 1's,
followed by an equal number of 2's, followed by an equal number of 3's."

My favorite example is biological: pseudonots in RNA require a
context-sensitive language to describe.  See David Searls' "The
Computational Linguistics of Biological Sequences," in my 1993 book
"Artificial Intelligence and Molecular Biology," available free on the net
at http://www.aaai.org:80/Press/Books/Hunter/hunter-contents.html

Larry

-- 
Lawrence Hunter, Ph.D.        Chief, Molecular Statistics and Bioinformatics
National Cancer Institute                             email: ·······@nih.gov
Federal Building, Room 3C06                         phone: +1 (301) 402-0389
7550 Wisconsin Ave.                                   fax: +1 (301) 480-0223
Bethesda, MD 20892  USA

From: William Deakin
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Tue, 26 Oct 1999 00:00:00 +0000
Message-ID: <3815C314.475959FD@pindar.com>

Larry Hunter wrote:

> It's actually amazing how many fairly simple concepts require
> context-sensitive...grammars. Probably the best known example is the "copy
> language", that is, any string followed by a copy of itself.

Hmmm. This is interesting but is the copy language actually a language? In my
simple way, I would say this more looks like a way of encoding data.

> Another example is "a string consisting of a number of 1's, followed by an
> equal number of 2's, followed by an equal number of 3's."

This is also interesting, as it is possible to parse the english sentence using
a context free grammar. However, the meaning of the sentence requires context.
But I probably have misunderstood you.

Thanks for the reference, I have had a quick skim read, and it looks very
interesting.

Best Regards,

:) will

From: Larry Hunter
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Tue, 26 Oct 1999 00:00:00 +0000
Message-ID: <3iszox6xapy.fsf@work.nci.nih.gov>

William Deakin said: 

  Larry Hunter wrote:

    > "It's actually amazing how many fairly simple concepts require
    > context-sensitive...grammars."

    Probably the best known example is the "copy language", that is, any
    string followed by a copy of itself.

  Hmmm. This is interesting but is the copy language actually a language? In my
  simple way, I would say this more looks like a way of encoding data.

Sure its a language.  The standard way to define it is to let S (usually
written as a capital sigma) be the characters in the language, and let * be
the Kleene star (meaning 0 or more occurances of the preceeding).  The copy
language is defined as

 L = { ww : w is an element of S*}

Quite simple, and definitely a language. For example, if S is [A-Z], then
AA, ABAB, ABCDABCD, EFGEFG, etc. are sentences in the copy language. 

           n
If we let X  mean n copies of the letter X, then we can write the other
language I mentioned as

        n n n
 L = { 1 2 3  : n > 0}

and 123, 112233, 111222333, etc. are sentences in the language.  Although
you may be right that the sentence I wrote describing the language might be
parsable by a complete context free grammar of English (as if there really
were such thing!), the language itself is most definitely not context free.

Larry

-- 
Lawrence Hunter, Ph.D.        Chief, Molecular Statistics and Bioinformatics
National Cancer Institute                             email: ·······@nih.gov
Federal Building, Room 3C06                         phone: +1 (301) 402-0389
7550 Wisconsin Ave.                                   fax: +1 (301) 480-0223
Bethesda, MD 20892  USA

From: William Deakin
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Wed, 27 Oct 1999 00:00:00 +0000
Message-ID: <3816B63F.E47C5820@pindar.com>

Larry Hunter wrote:

> William Deakin said:
>
>   Hmmm. This is interesting but is the copy language actually a language? In my
>   simple way, I would say this more looks like a way of encoding data.
>
> Sure its a language.

Yup, it is a language. But if I was to ask my sister (who is a fine-arts graduate)
she might not agree with you. That is all I was saying. Maybe I should have put a
:) in there.

Thanks for the description of languages, by the way. It was an interesting example
too, which in my hazy way I had not considered.

Best Regards,

:) will

From: Stig Hemmer
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Tue, 26 Oct 1999 00:00:00 +0000
Message-ID: <ekv3duy0zda.fsf@verden.pvv.ntnu.no>

William Deakin <·····@pindar.com> writes:
> Larry Hunter wrote:
> > It's actually amazing how many fairly simple concepts require
> > context-sensitive...grammars. Probably the best known example is the "copy
> > language", that is, any string followed by a copy of itself.
>
> Hmmm. This is interesting but is the copy language actually a language?

Be aware that mathematically inclined computer scientists(MICSes) use
the word "language" in a way that can be confusing to others.

Basically, a language is a subset of the set of the possible sequences
of characters chosen from some set.

In Hunters example "abcabc" is a member of the "copy language set",
whereas "abcacab" is not.

A "regular language" is the set of sequences matched by a regular
expression (excluding back references)

Stig Hemmer,
Jack of a Few Trades.

From: Chris Riesbeck
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Tue, 26 Oct 1999 00:00:00 +0000
Message-ID: <riesbeck-2610991257240001@riesbeck.ils.nwu.edu>

In article <·················@pindar.com>, ········@pindar.com wrote:

>Larry Hunter wrote:
>
>> It's actually amazing how many fairly simple concepts require
>> context-sensitive...grammars. Probably the best known example is the "copy
>> language", that is, any string followed by a copy of itself.
>
>Hmmm. This is interesting but is the copy language actually a language? In my
>simple way, I would say this more looks like a way of encoding data.
>
>> Another example is "a string consisting of a number of 1's, followed by an
>> equal number of 2's, followed by an equal number of 3's."
>
>This is also interesting, as it is possible to parse the english sentence using
>a context free grammar. However, the meaning of the sentence requires context.

Depends on what you want the parse tree to capture. The classic example
in English that is analagous to "copy language" is

  John, Mary, Fred, and Alice received a book, a medal, a pen,
  and a ribbon, respectively.

A CF grammar can parse this but the output won't make the right
cross-connections.

From: William Deakin
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Wed, 27 Oct 1999 00:00:00 +0000
Message-ID: <3816B550.E5459A29@pindar.com>

Chris Riesbeck wrote:

> Depends on what you want the parse tree to capture. The classic example in English
> that is analagous to "copy language" is
>
>   John, Mary, Fred, and Alice received a book, a medal, a pen, and a ribbon,
> respectively.
>
> A CF grammar can parse this but the output won't make the right cross-connections.

That is a very good example. Thank you.

My concern and probably the source of my confusion that what I think is being talked
about in this case (that is the wrong cross-connections are being made) is a matter
of meaning. And that what lexical and syntactic parsing does not do, is to assign
meaning.

Best Regards,

:) will

From: Chris Riesbeck
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Wed, 27 Oct 1999 00:00:00 +0000
Message-ID: <riesbeck-2710991223410001@riesbeck.ils.nwu.edu>

In article <·················@pindar.com>, ········@pindar.com wrote:

>Chris Riesbeck wrote:
>>
>>   John, Mary, Fred, and Alice received a book, a medal, a pen, and a ribbon,
>> respectively.
>>
>> A CF grammar can parse this but the output won't make the right
cross-connections.
>
>My concern and probably the source of my confusion that what I think is
being talked
>about in this case (that is the wrong cross-connections are being made)
is a matter
>of meaning. And that what lexical and syntactic parsing does not do, is
to assign
>meaning.

If you believe that semantic interpretation is based on
the syntactic parse tree (which most people do, though I don't), 
then you have to get the cross-connections in the parse tree 
right for the semantic interpreter to be able to do its job.

From: Lieven Marchand
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Tue, 26 Oct 1999 00:00:00 +0000
Message-ID: <m3aep6yrah.fsf@localhost.localdomain>

William Deakin <·····@pindar.com> writes:

> Larry Hunter wrote:
> 
> > It's actually amazing how many fairly simple concepts require
> > context-sensitive...grammars. Probably the best known example is the "copy
> > language", that is, any string followed by a copy of itself.
> 
> Hmmm. This is interesting but is the copy language actually a language? In my
> simple way, I would say this more looks like a way of encoding data.
> 

Sure it is. It's even very common in computer languages. The following
Modula or Ada style function is an example:

FUNCTION Foo () =
BEGIN
<Body>
END Foo.

As Pierre said, the usual way to handle this in parser generators for
context free languages is to change the grammar to:

FUNCTION Identifier <ArgumentList> =
BEGIN
<Body>
END Identifier.

and checking the equality of the Identifiers in a later stage. So
theoretically, the parser accepts a larger set of strings and a later
stage removes the inappropriate ones.

-- 
Lieven Marchand <···@bewoner.dma.be>
If there are aliens, they play Go. -- Lasker

From: William Deakin
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Wed, 27 Oct 1999 00:00:00 +0000
Message-ID: <3816B48A.F052565B@pindar.com>

I wasn't arguing with Pierre (I hope he doesn't think thats what I was).

I suppose what I was saying in my own vague way was that I hadn't come across
context dependant grammars was because the ones that I had were transformed into
context free ones.

I was being smart, and what everbody likes on this newsgroup is to kick somebody
who is being smart.

Best Regards,

:) will

From: Pierre R. Mai
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Wed, 27 Oct 1999 00:00:00 +0000
Message-ID: <87g0ywooel.fsf@orion.dent.isdn.cs.tu-berlin.de>

William Deakin <·····@pindar.com> writes:

> I wasn't arguing with Pierre (I hope he doesn't think thats what I
> was).

Rest assured, I don't... ;)

I also wasn't really trying to make a point either, I just wanted to
offer another perspective to grammars, because context-sensitive (yep, 
I often get my terms muddled up somewhat when not thinking clearly,
which happens more often than I'd like... ;) grammars are often
forgotten about...

Regs, Pierre.

-- 
Pierre Mai <····@acm.org>         PGP and GPG keys at your nearest Keyserver
  "One smaller motivation which, in part, stems from altruism is Microsoft-
   bashing." [Microsoft memo, see http://www.opensource.org/halloween1.html]

From: Tim Bradshaw
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Tue, 26 Oct 1999 00:00:00 +0000
Message-ID: <ey3ln8qf02b.fsf@lostwithiel.tfeb.org>

* William Deakin wrote:

> Hmmm. This is interesting but is the copy language actually a language? In my
> simple way, I would say this more looks like a way of encoding data.

It's a language in the sense that it specifies a set of strings which
belong to it, which (specified much more precisely) is the usual way
of defining a language in these contexts.

--tim

From: William Deakin
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Wed, 27 Oct 1999 00:00:00 +0000
Message-ID: <3816B173.225008A8@pindar.com>

Tim Bradshaw wrote:

> ...which is the usual way of defining a language in these contexts.

Ah ha, a context sensitive meaning ;)

Best Regards,

:) will

From: William Deakin
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Mon, 25 Oct 1999 00:00:00 +0000
Message-ID: <381467DB.BD255CBE@pindar.com>

Marco Antoniotti wrote:

> > 3) I currently think that parsers (yacc, etc.) should be regexp-based.
>
> Forgive my bluntness, but I believe a re-reading of the white book or
> of the Dragon book is in order :)

Could you tell me what the white book and the Dragon book are?

> As a matter of fact, what is really needed (and something which has
> appeared in some of the most modern 'parsing languages' in the Java world
> and also in Zebu) is to have a single language (grammars) to specify
> Regular and (Extended) Context Free recognizers and generators.

Yes. I agree. If I was to say Chomsky, where would you see this sitting?

Best Regards,

:) will

From: William Deakin
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Mon, 25 Oct 1999 00:00:00 +0000
Message-ID: <38147C3C.731D20B3@pindar.com>

William Deakin wrote:

This is what I manages to sort on on the weekend and is the first part of the
requirements, discussing regular-expressions. I also have headings for the next two
sections of this document:

4.2 Matching, Binding-Values and Destructuring
5. Matching Syntax
5.1 Character Operations
5.2 Quantifying Matching
5.3 Character Range Operations
5.4 Anchor and Conditional Transitions
5.5 Case of Strings and Characters
5.6 Matching-Expression Operations
5.7 Binding and Grouping Syntax

If there is something glaringly obvious missing from this, please let me know. I have
notes on these section but need a little more time to turn this into something a little
more presentable. Any comments would be appreciated.

Best Regards,

:) will

4. Requirements:

An effective system for pattern matching, substitution and value binding for Common LISP
will (at least) allow `regular-expression' character and string matching; extend
matching to a larger set of data-types; define markers such as end-of-line; match and
substitute values in data; and have a mechanism to apply LISP functions and macros to
lexically scoped grouped or bound values.

In addition to this, the system should be sufficient flexible to allow user-definable
markers; allow easy manipulation of grouped or value-binding; control the way in which
the matches are carried out; and enable back-tracking, either explicitly or to ensure
back-tracking is possible.

4.1 Regular-Expressions.

First, what are `regular-expressions?' and how will they to be different from
perl/awk/emacs `regular-expressions?' The following discussion is based on my views
about this, and should not be read as fact:

The use of `regular-expressions' consists of: matching patterns within data;
substituting values for matched patterns; and extracting groupings (or binding-values)
from data.

This is usually implemented as follows: A regular-expression or grammar is passed to an
interpreter or compilation engine; the engine applies the grammar to data; and the
engine then returns a value indicating a sucessful match or `parse'. If successful, a
copy of the data with substitutions applied, and any resulting bound-values are
returned.

However, as I understand this, most implementations extend what is meant by a
regular-expression. For example grouping and reverse searching in perl are not
regular-expression operations. This extended syntax can be described using a Chomsky
`regular' or level 3 language grammar. The regular-expression is a `little-language'
definition and the engine then interprets or compiles the regular express, and applies
this grammar to data.

In line with other regular-expression implementations, the proposed matching system will
extend true regular-expression matching. In this case, what extensions should be
included? It is my feeling that these kind of decisions should pragmatic, and as such
should be around such issues as efficiency and ease of implementation. But, if this must
be balanced with doing `the-right-thing' rather than having `worse-is-better' as the
design methodology.

From: ········@my-deja.com
Subject: Re: Update to A Proposed Specification for Lisp REGEXP
Date: Tue, 02 Nov 1999 00:00:00 +0000
Message-ID: <7vn3p5$4ou$1@nnrp1.deja.com>

Work on the specification is progressing, but slowly. The difficulties
are to do with syntax, what the pattern-matching system is to do and
what the syntax is to be. This posting is a kind of "work in progress."

I have the following examples of a way to implement a matching-system:

> (match (:bind  a (:bind b :bind (word) c) :bind (&rest) d)
        (9.6 "potato" 32 1 84)
        (list a b c d))

> (9.6 #\p "otato" (32 1 84))

> (match (p (:bind x))
        (p a)
        (values (and :match-p (values `((x . ,',x))))
                 :match-p))

> nil

> (match (p (:bind x) b (:bind y) c)
         (p (:bind y)  b d c)
         (values (and :match-p values `((x . ,',x) (y . ,',y)))
                 :match-p))
> ((x . c) (y . x))
  t

Or equivalent, ((x . c) (y . c)) &c.

Applying bindings in order:

(defmacro identifier (data)
  `(match :reduce (:bind (and alpha (* alnum)) this))
       ,',@data
       `(if (:match-p)
         (values ((type :ident) (name ,',this)) :match-p)
         nil))

(defmacro assignment (data)
  `(match :reduce (:bind (and (? (or #\+ #\- #\* #\/)) #\=)) this)
       ,',@data
       `(if (:match-p)
          (values ((type :assign) (name ,this)))
          nil))

(defmacro value (data)
  `(match :reduce ((:bind (? (or #\+ #\-) sign))
                          (:bind (+ digit) value))
                         ,',@data
                         `(if (:match-p)
                            (values ((type :value) (sign ,',this) (value
,',value)))
                            nil)))

A token macro could look like:

(defun token (data)
   (or (value data) (assignment data) (value data))

> (match :reduce "a24+=7" (:bind #'token this) (values this :data))
> ((type :ident) (name "a42))
   "+=-7"

> (while (match :reduce "a24+=-7" (:bind #'token this) (values this)))
> ((type :ident) (name "a42"))
  ((type :assign) (name "+="))
  ((type :value) (sign "-") (value "7"))

This syntax is pretty ugly, but consists of:

:bind to indicate a value to be bound, this then consists of a
matching-expression and name for the value to be bound to.

:reduce indicates all data upto the match should be removed.

:match-p indicates whether a match has been sucessful.

* and ? are the usual quantifiers. A suggested syntax is:

{n,m}    between n and m;
{n,}     at least n;
{,m}     no more that m occurrences of the character(s);
*        equivalent to {0,};
+        equivalent to {1,};
?        {0,1}

The examples cover regular expression and destructuringing syntax.

I would be glad of any suggestions or help. In particular: does this do
what you would want of a matching-package? Is there a more elegant
method of doing this?

Thank you to everybody who has given input to this project so far. In
particular Paul Foley for his original sugestions that have lead to
these examples. However, all mistakes and errors are my own.

Best Regards,

:) will





Sent via Deja.com http://www.deja.com/
Before you buy.

From: Donald Fisk
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Mon, 25 Oct 1999 00:00:00 +0000
Message-ID: <3814BAFC.6F77DB7B@inthan.be>

Stig Hemmer wrote:
> 
> William Deakin <·····@pindar.com> writes:
> > What I would like to extend, however, is the way in which variables
> > are bound. There is a bit in Olin's excellent specification where he
> > talks about the difficulties of binding within regular
> > expressions. I agree, but only to a degree. I think this is an area
> > where Lisp Can Win Big.
> 
> You might want to look at Paul Grahams book "On Lisp" where he
> addresses this in the context of embedding Prolog in Lisp.

You might also want to look at the language Snobol 4, which does
binding within what are very similar to regular expressions.
See www.snobol4.com for more details.   I'm not suggesting the
Snobol syntax be followed -- as long as the regular expressions
are more readable than the ones in the Pathologically Eclectic
Rubbish Lister I don't mind.   It's just that Ralph Griswold
solved this particular problem for Snobol a long time ago.

> Stig Hemmer,

Le Hibou (ma propre opinion)

-- 
"                      |
            Ceci n'est pas une pipe.
"
                            -- Daniel B. Case.

From: Marco Antoniotti
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Tue, 26 Oct 1999 00:00:00 +0000
Message-ID: <lwiu3u621r.fsf@copernico.parades.rm.cnr.it>

Donald Fisk <···········@inthan.be> writes:

> You might also want to look at the language Snobol 4, which does
> binding within what are very similar to regular expressions.
> See www.snobol4.com for more details.   I'm not suggesting the
> Snobol syntax be followed -- as long as the regular expressions
> are more readable than the ones in the Pathologically Eclectic
> Rubbish Lister I don't mind.   It's just that Ralph Griswold
> solved this particular problem for Snobol a long time ago.

Wasn't Griswold also involved in the development of the Icon language?
http://www.cs.arizona.edu/icon

I am not that familiar with it, but AFAIR, it may be a more modern
starting point.

Cheers

-- 
Marco Antoniotti ===========================================
PARADES, Via San Pantaleo 66, I-00186 Rome, ITALY
tel. +39 - 06 68 10 03 17, fax. +39 - 06 68 80 79 26
http://www.parades.rm.cnr.it/~marcoxa

From: William Deakin
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Tue, 26 Oct 1999 00:00:00 +0000
Message-ID: <381566BF.BF212D12@pindar.com>

Marco Antoniotti wrote:

> Donald Fisk writes:
> > You might also want to look at the language Snobol 4 ....
>
> Wasn't Griswold also involved in the development of the Icon language?
> http://www.cs.arizona.edu/icon

Thank you both for the references, I will have a look some time soon.

Best Regards,

:) will

From: Eric Scott
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Wed, 13 Oct 1999 00:00:00 +0000
Message-ID: <3804F558.39E3@schemas.sdsu.edu>

William Deakin Writes:

> 5. Unresolved Issues:
> 
> 5.1 Regular-Expression Macro-Character:
> 
> The regular-expression macro-character has to be decided upon. It is
> suggested that #! is used.
> 

I believe that is already used by the PLOb! (Persistent Lisp Objects)
package...

- Eric Scott

From: Rob Warnock
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Wed, 13 Oct 1999 00:00:00 +0000
Message-ID: <7u35q0$31b93@fido.engr.sgi.com>

Eric Scott  <········@schemas.sdsu.edu> wrote:
+---------------
| William Deakin Writes:
| > The regular-expression macro-character has to be decided upon. It is
| > suggested that #! is used.
| 
| I believe that is already used by the PLOb! (Persistent Lisp Objects)
| package...
+---------------

Not to mention Unix/Linux shell scripts that use the "#!/path/to/interp"
syntax...


-Rob

-----
Rob Warnock, 8L-846		····@sgi.com
Applied Networking		http://reality.sgi.com/rpw3/
Silicon Graphics, Inc.		Phone: 650-933-1673
1600 Amphitheatre Pkwy.		FAX: 650-933-0511
Mountain View, CA  94043	PP-ASEL-IA

From: Ray Blaak
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Wed, 13 Oct 1999 00:00:00 +0000
Message-ID: <m33dve37hs.fsf@vault82.infomatch.bc.ca>

William Deakin <·····@pindar.com> writes:
> I have sat down and looked at what I would like to see in a REGEXP
> package for cl. This is an early draft document and comments are
> welcome:

I think you already know about the s-expression notation for regexps (SRE),
described at: http://www.ai.mit.edu/~shivers/sre.txt

Why not just go with it? It is well designed, addresses all of the issues you
mentioned, and is implementable in any lisp system (the current implementation
is in Scheme, but the notation is actually independent of Scheme).

I think it more-or-less "out lisps" perl regexps.

-- 
Cheers,                                        The Rhythm is around me,
                                               The Rhythm has control.
Ray Blaak                                      The Rhythm is inside me,
·····@infomatch.com                            The Rhythm has my soul.

From: William Deakin
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Thu, 14 Oct 1999 00:00:00 +0000
Message-ID: <38059099.70BC3300@pindar.com>

Ray Blaak wrote:

> Why not just go with it?

Because I am bone idle and stupid ;) No, sorry, because I was away on holiday and
wanted to do something when I got back. But to do this without reference to any
other system and to DO IT NOW. I thought time was slipping by so in a quiet
half-hour I banged out that specification.

OK. Now for some reading of SRE. I will update the specification accordingly
sometime soon.

Thank you for you comments,

Best Regards,

:) will

From: Christopher R. Barry
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Thu, 14 Oct 1999 00:00:00 +0000
Message-ID: <87iu49aob9.fsf@2xtreme.net>

William Deakin <·····@pindar.com> writes:

> I was away on holiday and wanted to do something when I got back.
> But to do this without reference to any other system and to DO IT
> NOW. I thought time was slipping by so in a quiet half-hour I banged
> out that specification.

Will,

Why do you think this is a good task for you? Yesterday when I fired
up gnus and saw "A Proposed Specification for Lisp REGEXP", I got a
little excited, because I was expecting, well, an actual
specification. Something akin to the specification for Extended LOOP
in the HyperSpec, for example.

Making a specification for this kind of thing is _hard work_. The Lisp
community, in particular, is a hard customer to sell new language
features and ideas.

If you want to design and implement Yet Another REGEXP Package,
perhaps with some design feedback and programming help from others,
then start a mailing-list just like every other community software
project on the planet has and have at it.

If you want to design a specification for the missing One True
Future-ANSI-Common-Lisp Regular Expression Facility, you're looking at
at least several hundred hours of work before you have something
presentable.

Most of your posts don't have a lot of Lisp content. I tend to skim
over most of them. In the ones that do have Lisp content, you're
usually asking questions about beginning things (like "Potatos and
CLOS" [IIRC]). I'd recommend you learn more about what Lisp already
has and how to use it, then revisit this if you're ready to be patient
and you're not going to try and bang something out in under an hour
and then boldly announce it as "A Proposed Specification for Lisp
REGEXP" to the community.

Christopher

From: Quentin Deakin
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Thu, 14 Oct 1999 00:00:00 +0000
Message-ID: <7u58cv$e0r$1@epos.tesco.net>

Christopher R. Barry wrote:
>Why do you think this is a good task for you?

It is very likely not. No. But, If not now, when? if not me, who?

> Yesterday when I fired up gnus and saw "A Proposed Specification for Lisp
REGEXP", I got a little excited, because I was expecting, well, an actual
specification.

Good. I'm glad you got a little excited. I'm a little excited too. This is
the kind of thing that people should get excited about.

> Something akin to the specification for Extended LOOP in the HyperSpec,
for example.

Yes, my proposed specification is tosh. And having looked at what else is
out there I probably have shot myself in the foot by publishing a very rough
draft as to what I would like to see in a REGEXP package. But I must say
that I was trying to gauge what kind of interest there would be in such a
thing.

Also, this is the way I work. I am given a project from a (the) customer.
Then I write a very rough specification. Put it out for discussion. Let
people discuss this and on the basis of this very rough specification, look
at what is wanted. Update the specification, putting a large amount of work
to get it up to scratch. Then get to code the thing.

>...start a mailing-list just like every other community software project on
the planet has and have at it.

OK. Yes. That is what I was thinking of doing. But I would like this to be
more than just Yet Another REGEXP Package. I would like to generate some
kind of document that tries to set out what should be in a REGEXP package.
And something that I don't own but is the collected wisdon of many people.

>If you want to design a specification for the missing One True
Future-ANSI-Common-Lisp Regular Expression Facility, ...

No, I not sure that is what I am trying to do. I would like to do something
that solves some problems that I have got and that seem to cause a lot of
people some grief. Looking at what other people say, at least.

> ...you're looking at at least several hundred hours of work before you
have  something presentable.

That is true. But a journey of a thousand miles starts with but one step.

> I'd recommend you learn more about what Lisp already has and how to use
it, then > revisit this if you're ready to be patient

I am sorry that you don't like my posts and think that I ask beginners
questions. I do not pretend to be something that I am not. However, there is
a purpose to asking these question. I have been thinking about regexp for
some weeks now and wanted to see what reaction I would get. Fools rush in
and all that.

>and you're not going to try and bang something out in under an hour
>and then boldly announce it as "A Proposed Specification for Lisp
>REGEXP" to the community.

I am truly sorry that I have disappointed you so much. This was a
prentensious  title for a rather grandiose idea. On the face of it what I
posted was ill conceived and I should have spent more time on the draft.
But, I am an enthusiast and get carried away and this is USENET not Physical
Review Letters.

Also, I feel sad when I read your post. It turns the colours a little greyer
and the light a little dimmer. Where is your enthusiasm and joy in life? Do
you think a regexp specification is worthwhile? If you do, has anybody tried
something like this? I did a search on dejanews about regexp and lisp and it
is something that has run and run.

Finally, I though I could get something going which would help me to try and
gain some acceptance of lisp. This is my selfish motivation, to try and stop
cl from being written out of my place of work.

Manager: 'We must get rid of Lisp and replace it with C++ and perl'
Me: 'why?'
Manager: 'Well lots of people know C++ so no worries about a shortage of
employees. And perl because it has a really good regex for text file
manipulation.'

Lisp currently has neither of these, or why else is this said ;).  So: I
would like to see Lisp out-lisp perl (and C++ for that matter) and to try
and get more people using lisp.

Best Regards,

:) will

From: Tim Bradshaw
Subject: Re: A Proposed Specification for Lisp REGEXP
Date: Mon, 18 Oct 1999 00:00:00 +0000
Message-ID: <ey3g0z8edcr.fsf@lostwithiel.tfeb.org>

* Christopher R Barry wrote:

> If you want to design and implement Yet Another REGEXP Package,
> perhaps with some design feedback and programming help from others,
> then start a mailing-list just like every other community software
> project on the planet has and have at it.

I'd much rather he posted at least announcements to CLL, I really
don't need yet another special-purpose mailing list.

--tim