Parsing not in the spirit of Lisp?

From: Mitchell Morris
Subject: Parsing not in the spirit of Lisp?
Date: Thu, 30 Sep 1999 00:00:00 +0000
Message-ID: <slrn7v6v5l.c10.mgm@unpkhswm04.bscc.bls.com>

I've now read from several sources that parsing isn't in the spirit of Lisp.
I must admit this confuses me, in that I also read Lisp advocacy that it
shouldn't be necessary to use other inferior languages (like C++) to
accomplish any significant task.

I can reconcile these two concepts when I'm trying to read data that I
generate myself, but I can't seem to reconcile them when I'm trying to read
the data supplied by my client(s). It's perhaps "purer" (for some definition
of "pure") to claim that the input data should already be in a Lisp-friendly
format, but it doesn't seem practical to push back on my customers like that
in every (or even many) cases.

So tell me ... what's a boy to do? Write some not-Lisp-y Lisp to parse the
data? Write a C++ preprocessor and insert it into the workflow ahead of the
Lisp process? Quit programming and take up burglary?

+Mitchell

P.S. Of course, it's always possible that I'm manufacturing work for myself
by trying to be philosophically pure. If so, please feel free to point that
out too.



-- 
Mitchell Morris

You people all use ADSL to be cool, right? I'm still using ATM, because I
don't like TrueType.
	-- Kibo

Re: Parsing not in the spirit of Lisp? Marc Cavazza
- Re: Parsing not in the spirit of Lisp? Christopher R. Barry
  - Re: Parsing not in the spirit of Lisp? Klaus Schilling
    - Re: Parsing not in the spirit of Lisp? Dorai Sitaram
      - Re: Parsing not in the spirit of Lisp? Gareth McCaughan
        Re: Parsing not in the spirit of Lisp? Tim Bradshaw
        Re: Parsing not in the spirit of Lisp? Gareth McCaughan
        Re: Parsing not in the spirit of Lisp? Tim Bradshaw
        Re: Parsing not in the spirit of Lisp? Hannu Koivisto
        Re: Parsing not in the spirit of Lisp? William Deakin
        An s-expression notation for regexps Olin Shivers
        Re: An s-expression notation for regexps Marco Antoniotti
        Re: An s-expression notation for regexps Marco Antoniotti
        Re: An s-expression notation for regexps Dave Bakhash
        Re: An s-expression notation for regexps Christopher R. Barry
        Re: Parsing not in the spirit of Lisp? Lars Marius Garshol
        Re: Parsing not in the spirit of Lisp? Marco Antoniotti
        Re: Parsing not in the spirit of Lisp? William Deakin
      - Re: Parsing not in the spirit of Lisp? Christopher Browne
        Re: Parsing not in the spirit of Lisp? Tim Bradshaw
      - Re: Parsing not in the spirit of Lisp? Hannu Koivisto
        Re: Parsing not in the spirit of Lisp? Christopher R. Barry
        Re: Parsing not in the spirit of Lisp? Samir Barjoud
        Re: Parsing not in the spirit of Lisp? Tim Bradshaw
        Re: Parsing not in the spirit of Lisp? Hannu Koivisto
        Re: Parsing not in the spirit of Lisp? Graham Hughes
        Re: Parsing not in the spirit of Lisp? R. Matthew Emerson
      - Re: Parsing not in the spirit of Lisp? Raymond Toy
    - Re: Parsing not in the spirit of Lisp? Reini Urban
  - Re: Parsing not in the spirit of Lisp? Tim Bradshaw
  - Re: Parsing not in the spirit of Lisp? Duane Rettig
    - Re: Parsing not in the spirit of Lisp? Marco Antoniotti
      - Re: Parsing not in the spirit of Lisp? Christopher R. Barry
        Re: Parsing not in the spirit of Lisp? Tim Bradshaw
        Re: Parsing not in the spirit of Lisp? Duane Rettig
        Re: Parsing not in the spirit of Lisp? Christopher R. Barry
- Re: Parsing not in the spirit of Lisp? Fernando Mato Mira
Re: Parsing not in the spirit of Lisp? ···········@alcoa.com
- Re: Parsing not in the spirit of Lisp? Mitchell Morris
Re: Parsing not in the spirit of Lisp? Duane Rettig
- Re: Parsing not in the spirit of Lisp? William Deakin
- Re: Parsing not in the spirit of Lisp? Paolo Amoroso
Re: Parsing not in the spirit of Lisp? Tim Bradshaw
Re: Parsing not in the spirit of Lisp? Lieven Marchand
- Re: Parsing not in the spirit of Lisp? Arthur Lemmens

From: Marc Cavazza
Subject: Re: Parsing not in the spirit of Lisp?
Date: Thu, 30 Sep 1999 00:00:00 +0000
Message-ID: <37F38DD6.98E5EE6@bradford.ac.uk>

Mitchell Morris wrote:

> I've now read from several sources that parsing isn't in the spirit of Lisp.

I'm not sure to understand this (probably because I develop NL parsers
in Lisp :-)

Now, getting back to parsing other things than NL, there has been a case
made that Lisp was "poor" at string processing (generally by those advocating
Lisp/C integration), but that's all I recall

> So tell me ... what's a boy to do?

If all you have to parse are data formats why not writing your
own parser -- unless you can use some Lisp implementation
of the "usual suspects" (check the CMU repository)

> +Mitchell
>
> P.S. Of course, it's always possible that I'm manufacturing work for myself
> by trying to be philosophically pure. If so, please feel free to point that
> out too.

Thanks for opening that huge can of worms. Let us make it the next thread.

Marc

From: Christopher R. Barry
Subject: Re: Parsing not in the spirit of Lisp?
Date: Thu, 30 Sep 1999 00:00:00 +0000
Message-ID: <87emfgp1j1.fsf@2xtreme.net>

Marc Cavazza <·········@bradford.ac.uk> writes:

> Now, getting back to parsing other things than NL, there has been a case
> made that Lisp was "poor" at string processing (generally by those advocating
> Lisp/C integration), but that's all I recall

Well, ANSI Common Lisp is poor for general string processing due to
its lack of any regular expression facility. It has very powerful
formatted output capabilities, however.

You can of course make small Perl or Emacs tools to remedy the regular
expression deficiency. (Allegro CL also has a regex package, but IMHO
there is room for a bit of improvement and I'd use Perl or egrep
instead.)

Christopher

From: Klaus Schilling
Subject: Re: Parsing not in the spirit of Lisp?
Date: Fri, 01 Oct 1999 00:00:00 +0000
Message-ID: <87puyzejkf.fsf@schilling.cys.de>

······@2xtreme.net (Christopher R. Barry) writes:

> Marc Cavazza <·········@bradford.ac.uk> writes:
> 
> > Now, getting back to parsing other things than NL, there has been a case
> > made that Lisp was "poor" at string processing (generally by those advocating
> > Lisp/C integration), but that's all I recall
> 
> Well, ANSI Common Lisp is poor for general string processing due to
> its lack of any regular expression facility. It has very powerful
> formatted output capabilities, however.
> 

Is it possible (even when depending on the Implementation) to do
ffi calls to some regex library?

Klaus Schilling

From: Dorai Sitaram
Subject: Re: Parsing not in the spirit of Lisp?
Date: Fri, 01 Oct 1999 00:00:00 +0000
Message-ID: <7t2tr9$avf$1@news.gte.com>

In article <··············@schilling.cys.de>,
Klaus Schilling  <·····@schilling.cys.de> wrote:
>······@2xtreme.net (Christopher R. Barry) writes:
>> 
>> Well, ANSI Common Lisp is poor for general string processing due to
>> its lack of any regular expression facility. It has very powerful
>> formatted output capabilities, however.
>
>Is it possible (even when depending on the Implementation) to do
>ffi calls to some regex library?

Another naive question: All talk about regexps in CL
seems to center on how to access a foreign library and
the difficulties that that causes (Henry Spencer's
style, CBIND problems, etc.).  Is this because it would
be prohibitively inefficient to write a regexp library
entirely within CL?  Has this even been attempted?

--d

From: Gareth McCaughan
Subject: Re: Parsing not in the spirit of Lisp?
Date: Sat, 02 Oct 1999 00:00:00 +0000
Message-ID: <86hfkazky1.fsf@g.local>

Dorai Sitaram wrote:

> Another naive question: All talk about regexps in CL
> seems to center on how to access a foreign library and
> the difficulties that that causes (Henry Spencer's
> style, CBIND problems, etc.).  Is this because it would
> be prohibitively inefficient to write a regexp library
> entirely within CL?  Has this even been attempted?

I think it's because (1) there are already some good
regexp libraries written in C, (2) most Lisp systems
have some way of talking to libraries written in C,
and (3) writing a really good regexp library is a
difficult and time-consuming job.

-- 
Gareth McCaughan  ················@pobox.com
sig under construction

From: Tim Bradshaw
Subject: Re: Parsing not in the spirit of Lisp?
Date: Sat, 02 Oct 1999 00:00:00 +0000
Message-ID: <ey37ll5vr77.fsf@lostwithiel.tfeb.org>

* Gareth McCaughan wrote:

> I think it's because (1) there are already some good
> regexp libraries written in C, (2) most Lisp systems
> have some way of talking to libraries written in C,
> and (3) writing a really good regexp library is a
> difficult and time-consuming job.

However, it's a significant win if you can do it.  I would *much*
rather talk about some nice structure-based regexp system than the
horrible string-based stuff one actually gets.  Think how cool regular
expressions would be if you could talk about them as list structures,
say.  Lisp could really win here.

I think that some of the scsh people -- perhaps Olin Shivers -- have
done some work along these lines.  I wonder if there's an
implementation which could be CL-ified?

--tim

From: Gareth McCaughan
Subject: Re: Parsing not in the spirit of Lisp?
Date: Sun, 03 Oct 1999 00:00:00 +0000
Message-ID: <86u2o8rsse.fsf@g.local>

Tim Bradshaw wrote:

[regexps]
> However, it's a significant win if you can do it.  I would *much*
> rather talk about some nice structure-based regexp system than the
> horrible string-based stuff one actually gets.  Think how cool regular
> expressions would be if you could talk about them as list structures,
> say.  Lisp could really win here.

Do you mean (1) describing regexps with list structure, or (2)
applying regexps to list structure? I'm guessing the former.

I think that would be good to have, but I'd want there to be
a string form too, for (1) compactness and (2) standardness.
For big hairy REs and sophisticated applications of them, a
more sophisticated representation would be great, but I'd much
rather write "foo.*bar" than (catenate "foo" (repeat any-char) "bar")
most of the time.

Regexp-alikes that work on arbitrary sequences, or some modest
generalisation working on list structure, might be rather neat
too.

-- 
Gareth McCaughan  ················@pobox.com
sig under construction

From: Tim Bradshaw
Subject: Re: Parsing not in the spirit of Lisp?
Date: Mon, 04 Oct 1999 00:00:00 +0000
Message-ID: <ey3u2o7xv07.fsf@lostwithiel.tfeb.org>

* Gareth McCaughan wrote:

> Do you mean (1) describing regexps with list structure, or (2)
> applying regexps to list structure? I'm guessing the former.

the former

> I think that would be good to have, but I'd want there to be
> a string form too, for (1) compactness and (2) standardness.
> For big hairy REs and sophisticated applications of them, a
> more sophisticated representation would be great, but I'd much
> rather write "foo.*bar" than (catenate "foo" (repeat any-char) "bar")
> most of the time.

I think I'd be happy to write (inventing this, of course) 
(and "foo" (* *) "bar"), I guess a string form would be useful, but
then you'd have to pick one of the myriad variants.

Of course, if you have a foreign-callable C regexp package, it's very
easy to compile down the structured version into strings that it will
eat.

--tim

From: Hannu Koivisto
Subject: Re: Parsing not in the spirit of Lisp?
Date: Mon, 04 Oct 1999 00:00:00 +0000
Message-ID: <87wvt3ckmi.fsf@senstation.vvf.fi>

Tim Bradshaw <···@tfeb.org> writes:

| I think I'd be happy to write (inventing this, of course) 
| (and "foo" (* *) "bar"), 

This would be (seq "foo" (* any) "bar") in SRE syntax.

| I guess a string form would be useful, but then you'd have to pick
| one of the myriad variants.

Yup.  SRE has picked POSIX.  You can even mix POSIX regexps inside
SRE regexps.  I guess you could add other major variants quite
easily after one.
 
| Of course, if you have a foreign-callable C regexp package, it's very
| easy to compile down the structured version into strings that it will
| eat.

It may appear so, but like Shivers comments in SRE documentation,
for example character set issues make it more complicated than it
could be.  I haven't really looked into that part as I don't plan
to use C regexp package at first at least.

-- 
Hannu

From: William Deakin
Subject: Re: Parsing not in the spirit of Lisp?
Date: Mon, 04 Oct 1999 00:00:00 +0000
Message-ID: <37F880F3.7B840F17@pindar.com>

Tim Bradshaw wrote:

> I would *much* rather talk about some nice structure-based regexp system
> than the horrible string-based stuff one actually gets.  Think how cool
> regular expressions would be if you could talk about them as list
> structures, say.  Lisp could really win here.

Yes so would I. This sounds most excellent.

In my usual hazy way I would like to ask: How far would this regexp
library be from a pattern matching library that binds values? I would be
interesting in something like this or doing something about this.  A fast
regexp system that does the string matching that Gareth McCaughan
describes would be nice up to a point, although it does smack of trying to
get CL to out-perl, perl. No bad thing :)

But, a more general regexp system that is along the lines of pat-match[1]
or match[2][3] would be a thing beauty to behold. And would out-lisp perl
;)

Best Regards,

:) will

[1] 'Paradigms of AI Programming: Case Studies in Common Lisp' by Peter
Norvig pp. 155-6, 158, 160, 181 and 332.
[2] 'ANSI Common Lisp' by Paul Graham. I haven't got the book to hand so I
can't give a page number.
[3] 'On LISP' by Paul Graham, p239.

From: Olin Shivers
Subject: An s-expression notation for regexps
Date: Mon, 04 Oct 1999 00:00:00 +0000
Message-ID: <qijg0zr6x7x.fsf_-_@lambda.ai.mit.edu>

I've recently gotten several inquiries about my SRE notation for
writing down regular expressions in an s-expression framework, due
to some discussion about regexps on this list. A couple of guys have
asked about licensing & copyright issues for my package. Here's the
deal:

There is a writeup describing the notation & its design (in simple text) at
    http://www.ai.mit.edu/~shivers/sre.txt
Just click 'n view.

If you'd like to get the Scheme source code, *don't* use the SRE source on
my home page; it is old and buggy.  (I think I will remove it from the site
now.) Instead, wait a day or two & watch comp.lang.scheme.scsh. We are
releasing scsh 0.5.2 as soon as we can copy the tar ball over to the server
(having a few server problems today at MIT). That release has a fully
functional SRE implementation, and the manual has 25 pages of documentation on
the system, in LaTeX.

*All* the code in the new release, including the SRE system, is open
source. Help yourself. Just remember to cut me in on the IPO.

Common Lisp hackers should note that the SRE notation really has *nothing* to
do with Scheme. Or Lisp, for that matter. It's just an s-expression notation
for writing down regular expressions. I made an effort when I designed it to
have the spec accomodate the particulars of Common Lisp's sexp grammar as well
as Scheme's sexp grammar.

William Deakin  (········@pindar.com) and Hannu Koivisto (·····@iki.fi)
have expressed interest in a Common Lisp port. Other people who are
interested ought to hook up with these guys.

Manuel Serrano & I expect to make the SRE spec a SRFI in the near future.
(SRFI's are a Scheme standardisation process.)  I'll try to remember to
announce it here on comp.lang.lisp when the discussion period starts.

Finally, a word of warning about regular expressions: don't overuse them.
People, especially people in the scripting/perl/text-hacking community
love to rave about how powerful regexps are. They aren't. Their efficiency
comes *directly* from limiting what they can do -- limiting them to matching
that can be performed by a finite-state automata.

Hackers are *always* misusing regexps to write heuristic parsers to parse
notations (like html) that are *not* "regular languages." This is the sort
of error-prone mostly-works hack that makes Unix programs so flaky.

If you want to parse something that is non-regular -- for example, something
that has recursive, nested structure, like HTML -- don't use regexps. Use a
more powerful parser generator. If one doesn't exist in Common Lisp, then
someone should go implement one. A beautiful example of how a good design can
make this super-easy and simple for programmers is the little macro that
Manuel Serrano built for Bigloo, a Scheme implementation, allowing programmers
to describe grammars and read in strings with them. I highly recommend that
interested hackers check out the Bigloo manual if they have any interest in
this kind of thing.

Use the right tools.
    -Olin

From: Marco Antoniotti
Subject: Re: An s-expression notation for regexps
Date: Mon, 04 Oct 1999 00:00:00 +0000
Message-ID: <lw4sg78a3l.fsf@copernico.parades.rm.cnr.it>

Olin Shivers <·······@lambda.ai.mit.edu> writes:

> Common Lisp hackers should note that the SRE notation really has *nothing* to
> do with Scheme. Or Lisp, for that matter. It's just an s-expression notation
> for writing down regular expressions. I made an effort when I designed it to
> have the spec accomodate the particulars of Common Lisp's sexp
> grammar as well as Scheme's sexp grammar.

I just had a second look at the SRE doc.  Thanks for having included
CL friendlier operators in SRE. #\| and #\: don't quite cut it
otherwise. :)

Cheers

-- 
Marco Antoniotti ===========================================
PARADES, Via San Pantaleo 66, I-00186 Rome, ITALY
tel. +39 - 06 68 10 03 17, fax. +39 - 06 68 80 79 26
http://www.parades.rm.cnr.it/~marcoxa

From: Marco Antoniotti
Subject: Re: An s-expression notation for regexps
Date: Mon, 04 Oct 1999 00:00:00 +0000
Message-ID: <lw3dvr89ld.fsf@copernico.parades.rm.cnr.it>

Marco Antoniotti <·······@copernico.parades.rm.cnr.it> writes:

> I just had a second look at the SRE doc.  Thanks for having included
> CL friendlier operators in SRE. #\| and #\: don't quite cut it
> otherwise. :)

That is to say: I stand corrected!

Cheers

-- 
Marco Antoniotti ===========================================
PARADES, Via San Pantaleo 66, I-00186 Rome, ITALY
tel. +39 - 06 68 10 03 17, fax. +39 - 06 68 80 79 26
http://www.parades.rm.cnr.it/~marcoxa

From: Dave Bakhash
Subject: Re: An s-expression notation for regexps
Date: Wed, 06 Oct 1999 00:00:00 +0000
Message-ID: <m3u2o4d2p3.fsf@lost-in-space.ne.mediaone.net>

where is this SRE spec located?

dave

From: Christopher R. Barry
Subject: Re: An s-expression notation for regexps
Date: Wed, 06 Oct 1999 00:00:00 +0000
Message-ID: <87vh8kxw4z.fsf@2xtreme.net>

Dave Bakhash <·····@bu.edu> writes:

> where is this SRE spec located?

Olin included near the beginning of his article:
http://www.ai.mit.edu/~shivers/sre.txt

Christopher

From: Lars Marius Garshol
Subject: Re: Parsing not in the spirit of Lisp?
Date: Mon, 04 Oct 1999 00:00:00 +0000
Message-ID: <m3so3richf.fsf@ifi.uio.no>

* Tim Bradshaw
| 
| I would *much* rather talk about some nice structure-based regexp
| system than the horrible string-based stuff one actually gets.
| Think how cool regular expressions would be if you could talk about
| them as list structures, say.  Lisp could really win here.

* William Deakin
| 
| In my usual hazy way I would like to ask: How far would this regexp
| library be from a pattern matching library that binds values? I
| would be interesting in something like this or doing something about
| this.

Then you should look at Olin Shivers's SRE. It's for Scheme, but the
design looks very good, and as always with Shivers the design paper is
a jewel of technical writing. Not looking at it would be a major
mistake.

--Lars M.

From: Marco Antoniotti
Subject: Re: Parsing not in the spirit of Lisp?
Date: Mon, 04 Oct 1999 00:00:00 +0000
Message-ID: <lw670n8adu.fsf@copernico.parades.rm.cnr.it>

Lars Marius Garshol <······@garshol.priv.no> writes:

> * Tim Bradshaw
> | 
> | I would *much* rather talk about some nice structure-based regexp
> | system than the horrible string-based stuff one actually gets.
> | Think how cool regular expressions would be if you could talk about
> | them as list structures, say.  Lisp could really win here.
> 
> * William Deakin
> | 
> | In my usual hazy way I would like to ask: How far would this regexp
> | library be from a pattern matching library that binds values? I
> | would be interesting in something like this or doing something about
> | this.
> 
> Then you should look at Olin Shivers's SRE. It's for Scheme, but the
> design looks very good, and as always with Shivers the design paper is
> a jewel of technical writing. Not looking at it would be a major
> mistake.

I agree.  Shivers' paper on SRE is very goo.  Unfortunately he make
the mistake (all right, this is way too nasty to say :) ) of making to
operators practically impossible to be used in Common Lisp.  E.g. the
disjunction operator is #\|.  Not good. :{

Cheers

-- 
Marco Antoniotti ===========================================
PARADES, Via San Pantaleo 66, I-00186 Rome, ITALY
tel. +39 - 06 68 10 03 17, fax. +39 - 06 68 80 79 26
http://www.parades.rm.cnr.it/~marcoxa

From: William Deakin
Subject: Re: Parsing not in the spirit of Lisp?
Date: Mon, 04 Oct 1999 00:00:00 +0000
Message-ID: <37F89AF8.BD276D4C@pindar.com>

Lars Marius Garshol wrote:

> Then you should look at Olin Shivers's SRE. It's for Scheme, but the
> design looks very good, and as always with Shivers the design paper is a
> jewel of technical writing. Not looking at it would be a major mistake.

Ta, la. I have downloaded sre (from the link given in this thread) and am
in the process of digging my way through it.

However, it is dated July 1998 and from these postings I get the feeling
it is not the most up to date version. Is this correct?

Then, if I was to try and port it to CL where would be a sensible place to
put it for evaluation?
(I'm not sure if this is legal so I would want to check this out and get
permission, so please don't flame me. I will have a long look at the code
first and see what I can do.)

Best Regards,

:) will

From: Christopher Browne
Subject: Re: Parsing not in the spirit of Lisp?
Date: Sat, 02 Oct 1999 00:00:00 +0000
Message-ID: <QzeJ3.30493$d71.878104@news4.giganews.com>

On 1 Oct 1999 18:17:13 GMT, Dorai Sitaram <····@bunny.gte.com> wrote:
>In article <··············@schilling.cys.de>,
>Klaus Schilling  <·····@schilling.cys.de> wrote:
>>······@2xtreme.net (Christopher R. Barry) writes:
>>> 
>>> Well, ANSI Common Lisp is poor for general string processing due to
>>> its lack of any regular expression facility. It has very powerful
>>> formatted output capabilities, however.
>>
>>Is it possible (even when depending on the Implementation) to do
>>ffi calls to some regex library?
>
>Another naive question: All talk about regexps in CL
>seems to center on how to access a foreign library and
>the difficulties that that causes (Henry Spencer's
>style, CBIND problems, etc.).  Is this because it would
>be prohibitively inefficient to write a regexp library
>entirely within CL?  Has this even been attempted?

You're left with some choices:

a) Write it portably entirely within CL, which I suspect might be
difficult to make efficient.

Not because it's intrinsically hard, but simply because a portable
*and* efficient implementation represents some complex code.

b) Create a non-portable-but-highly-efficient version that is
tightly-integrated with your favorite CL.  This involves convincing
the maker of that CL implementation to add it in.

--> Probably not too hard with CLISP  :=)
--> Probably quite challenging with CMULISP (I gather that
    CMULISP-hacking has a "long learning curve")
--> Economically challenging to convince a commercial CL maker to add
    this in.

Arguably the best answer is...

c) Convince the ANSI/ISO committee to add regexp handling functions to
the CL specification.

That allows option b) to become a "portable-and-efficient" version.

It's easy to convince them to add new functions, right?  :-)
-- 
Is the surface of a planet the right place for an expanding
technological civilization?
········@ntlug.org- <http://www.hex.net/~cbbrowne/langlisp.html>

From: Tim Bradshaw
Subject: Re: Parsing not in the spirit of Lisp?
Date: Sat, 02 Oct 1999 00:00:00 +0000
Message-ID: <ey3670pvr1n.fsf@lostwithiel.tfeb.org>

* Christopher Browne wrote:

> c) Convince the ANSI/ISO committee to add regexp handling functions to
> the CL specification.

> That allows option b) to become a "portable-and-efficient" version.

> It's easy to convince them to add new functions, right?  :-)

It should be easy to get them to standardise functionality which
already exists and is clearly common-but-slightly-varying across
implementations.  It should be very hard to get them to do design
work.

Even then, I for one would probably oppose a straight string-based
library which is just an obvious interface to one of the standard C
regexp systems -- I think CL can do better than that (see article
earlier in this thread).

--tim

From: Hannu Koivisto
Subject: Re: Parsing not in the spirit of Lisp?
Date: Sat, 02 Oct 1999 00:00:00 +0000
Message-ID: <87hfk9esu3.fsf@senstation.vvf.fi>

····@bunny.gte.com (Dorai Sitaram) writes:

[Replying to several people at the same time, sorry about possible
confusion.]

| Another naive question: All talk about regexps in CL
| seems to center on how to access a foreign library and
| the difficulties that that causes (Henry Spencer's
| style, CBIND problems, etc.).  Is this because it would
| be prohibitively inefficient to write a regexp library
| entirely within CL?  

I don't think so (I think it's mainly the amount of work needed).
As a matter of fact, I believe a regexp library in CL can be faster
(with same or less amount of work) than a one in C.  This may not
be attainable with current compilers, though, but I don't much care
about them anyway so this doesn't stop me from writing one.

| Has this even been attempted?

There are some small regexp engines in the context of some lexers
and even some separate ones, but nothing I would use.

Tim Bradshaw <···@tfeb.org> writes:

| * Gareth McCaughan wrote:
| 
| > I think it's because (1) there are already some good
| > regexp libraries written in C, 

Well, "good" is quite subjective.  First of all they are all
string-based, which, like Tim, I consider horrible.

| > (2) most Lisp systems
| > have some way of talking to libraries written in C,

Which completely prohibits many optimizations.

| > and (3) writing a really good regexp library is a
| > difficult and time-consuming job.

Tell me about it :)  I've this far spent months already even I have
already-written code as a model (with which my implementation is
sometimes quite equivalent).  I'm writing it in a quite pedantic
fashion, though.  For example, I've written the parser five times
from scratch already :)  Also, I write it in my spare time, which
doesn't mean too many hours per week.

| However, it's a significant win if you can do it.  I would *much*
| rather talk about some nice structure-based regexp system than the
| horrible string-based stuff one actually gets.  Think how cool regular
| expressions would be if you could talk about them as list structures,
| say.  Lisp could really win here.

I certainly believe it will.

| I think that some of the scsh people -- perhaps Olin Shivers -- have

Exactly Olin Shivers.

| done some work along these lines.  I wonder if there's an
| implementation which could be CL-ified?

Well, Shivers' implementation uses Spencer's POSIX engine, so just
CLifying it is not enough.  Besides, one can do more than just
CLify it [the parts written in Scheme] (of course, depending on
what you mean by CLifying).  Anyway, this is exactly what I'm
doing.

-- 
Hannu

From: Christopher R. Barry
Subject: Re: Parsing not in the spirit of Lisp?
Date: Sat, 02 Oct 1999 00:00:00 +0000
Message-ID: <87d7ux1xnk.fsf@2xtreme.net>

Hannu Koivisto <·····@iki.fi.ns> writes:

> I don't think so (I think it's mainly the amount of work needed).
> As a matter of fact, I believe a regexp library in CL can be faster
> (with same or less amount of work) than a one in C.  This may not
> be attainable with current compilers, though, but I don't much care
> about them anyway so this doesn't stop me from writing one.

I don't think a _portable_ regex library written in ANSI Common Lisp
could be faster than one written in _portable (ANSI) C. This is just
one of those things C is good for and Lisp probably isn't.

Ever benchmarked GNU egrep? It has got to be the FASTEST
implementation on the planet. Waaaaaaaaaaaay faster than Perl. (If
anyone has experience with an implementation or library faster than
GNU egrep, please share with us.) GNU egrep can do a moderately
complex matches through many megabytes of data in the snap of a
finger.

I'm curious to see how R. Mathew Emmerson's gzip code performs. This
is also something that GNU gzip is probably going to be _a lot_ faster
at.

It would be nice if ANSI CL had locatives and some other things... if
there was a way to do them right and still not take too much
flexibility from the vendors. (There probably isn't.)

Christopher

From: Samir Barjoud
Subject: Re: Parsing not in the spirit of Lisp?
Date: Sat, 02 Oct 1999 00:00:00 +0000
Message-ID: <wk7ll52u60.fsf@mindspring.com>

······@2xtreme.net (Christopher R. Barry) writes:
> 
> I don't think a _portable_ regex library written in ANSI Common Lisp
> could be faster than one written in _portable (ANSI) C. This is just
> one of those things C is good for and Lisp probably isn't.
> 

I wouldn't be surprised if a regex library written in Common Lisp
greatly outperformed one written in C.  After all, the fastest C regex
implementations compile the regex down into some form of byte code and
then interpret that code when matching strings.  The Common Lisp
version could compile the regex into the native _machine code_ that
does the matching.  

The regular expression "a*" (any number of a's) might result in the
following x86 code.

;; ESI points to current loc within string being matched

any_number_of_a:
        mov al,[esi]           ;fetch a character
        inc esi                ;point to next character
        cmp al,'a'             ;is the fetched character an 'a'?
        je any_number_of_a     ;if so keep matching

-- 
Samir Barjoud
·····@mindspring.com

From: Tim Bradshaw
Subject: Re: Parsing not in the spirit of Lisp?
Date: Sun, 03 Oct 1999 00:00:00 +0000
Message-ID: <ey3vh8oyhi9.fsf@lostwithiel.tfeb.org>

* Samir Barjoud wrote:

> The regular expression "a*" (any number of a's) might result in the
> following x86 code.

> ;; ESI points to current loc within string being matched

> any_number_of_a:
>         mov al,[esi]           ;fetch a character
>         inc esi                ;point to next character
>         cmp al,'a'             ;is the fetched character an 'a'?
>         je any_number_of_a     ;if so keep matching

I think the obvious implementation -- compiling a regexp to a bit of
Lisp code which in turn was compiled -- would be lucky to get code
this good, however you should still be able to get significantly fast
code for compiled regexps -- I think you'd quite likely beat any C
implementation which didn't compile regexps to bits of C code.

Back to the topic a bit, The scheme regexp package by Olin Shivers
(which appears, on superficial examination, to have some allowances in
its syntax for CL), lives at http://www.ai.mit.edu/~shivers/sre.txt I
think.

--tim

From: Hannu Koivisto
Subject: Re: Parsing not in the spirit of Lisp?
Date: Sun, 03 Oct 1999 00:00:00 +0000
Message-ID: <873dvsel6o.fsf@senstation.vvf.fi>

Tim Bradshaw <···@tfeb.org> writes:

| (which appears, on superficial examination, to have some allowances in
| its syntax for CL),

Yup.

| lives at http://www.ai.mit.edu/~shivers/sre.txt I think.

It lives at <URL:ftp://lambda.ai.mit.edu/pub/shivers/sre.tgz>.
That text file is its "spec/tutorial/design-rationale document"
like the page says.  Just in case someone else intends to fiddle
with it, Shivers informed me that the next scsh release contains
a version of SRE with bugfixes.

-- 
Hannu
(loop (luuk sorrow :loud t))

From: Graham Hughes
Subject: Re: Parsing not in the spirit of Lisp?
Date: Mon, 04 Oct 1999 00:00:00 +0000
Message-ID: <87670ox9v1.fsf@oak.treepeople.pedantic.org>

>>>>> "Tim" == Tim Bradshaw <···@tfeb.org> writes:

    Tim> I think the obvious implementation -- compiling a regexp to a
    Tim> bit of Lisp code which in turn was compiled -- would be lucky
    Tim> to get code this good, however you should still be able to get
    Tim> significantly fast code for compiled regexps -- I think you'd
    Tim> quite likely beat any C implementation which didn't compile
    Tim> regexps to bits of C code.

See, this is the beautiful thing.  *No* C regexp implementation compiles
the regexps to bits of C code, save compiler tools like lex.  It's
theoretically possible, of course, but no one is willing to embed gcc
into their C program for this, not even the Perl people.  Something like

(defun compile-regexp (regexp)
  (compile nil `(lambda (string) ,@(regexp-to-code (optimize regexp)))))

(or an equivalent macro) is *much* more tractable in Lisp.  After all,
optimize could just be #'identity for starters!  Then `all' you'd have
to do is translate the nice sexp for the regular expression into Lisp
code.

(NB: it is possible to have compiled C code for the regexp without
having to invoke gcc at runtime, and while remaining semi-platform
independent; Ian Piamuerta's techniques would probably help.  But the C
optimizer will never see the whole regexp and thus doesn't stand a
chance of getting the hyperoptimized x86 code that was suggested.)
-- 
Graham Hughes <······@ccs.ucsb.edu>
GPG Fingerprint: 4FC5 80F0 63EB 00BE F438  E365 084B 4010 60BF 17D3
((lambda (x) (list x (list 'quote x)))
 '(lambda (x) (list x (list 'quote x))))

From: R. Matthew Emerson
Subject: Re: Parsing not in the spirit of Lisp?
Date: Thu, 07 Oct 1999 00:00:00 +0000
Message-ID: <87emf8t4lg.fsf@nightfly.apk.net>

[sorry if you see this twice;  I made an idiotic typo in the cancelled one]

······@2xtreme.net (Christopher R. Barry) writes:

> I'm curious to see how R. Matthew Emerson's gzip code performs. This
> is also something that GNU gzip is probably going to be _a lot_ faster
> at.

My current code takes about 20 times as long as gzip does to
uncompress my sample file (5.5 MB compressed, 15MB uncompressed). The
GNU gzip program takes just under 2 seconds of real time; my code
takes about 37 seconds of real time.  (this is with a Pentium II 400
running FreeBSD 3.3-STABLE).

There's a small overhead to doing I/O though CLOS streams, but since
under 10% of the code's time is spent doing I/O, that overhead isn't
much of a issue at this point.

(As a side note, some measurements I've made suggest that
byte-at-a-time I/O with read-byte on CLOS streams in my CL
implementation is roughly 4--6 times more expensive than using
getchar() from C.  If you care, the code I used to do the measurements
is at http://www.thoughtstuff.com/rme/code/ )

If I get motivated, I may experiment by trying to translate the C code
from gzip into Lisp, and see how successful an attempt to stay down at
that bare metal level would be.

-matt

-- 
R. Matthew Emerson
http://www.thoughtstuff.com/rme/

From: Raymond Toy
Subject: Re: Parsing not in the spirit of Lisp?
Date: Mon, 04 Oct 1999 00:00:00 +0000
Message-ID: <4n7ll3dl48.fsf@rtp.ericsson.se>

>>>>> "Dorai" == Dorai Sitaram <····@bunny.gte.com> writes:

    Dorai> In article <··············@schilling.cys.de>,
    Dorai> Klaus Schilling  <·····@schilling.cys.de> wrote:
    >> ······@2xtreme.net (Christopher R. Barry) writes:
    >>> 
    >>> Well, ANSI Common Lisp is poor for general string processing due to
    >>> its lack of any regular expression facility. It has very powerful
    >>> formatted output capabilities, however.
    >> 
    >> Is it possible (even when depending on the Implementation) to do
    >> ffi calls to some regex library?

    Dorai> Another naive question: All talk about regexps in CL
    Dorai> seems to center on how to access a foreign library and
    Dorai> the difficulties that that causes (Henry Spencer's
    Dorai> style, CBIND problems, etc.).  Is this because it would
    Dorai> be prohibitively inefficient to write a regexp library
    Dorai> entirely within CL?  Has this even been attempted?

Yes, it's been done.  There's the nregexp regexp library in Lisp in
the CMU Lisp archives, and I have another written by someone else
whose name escapes me right now.  I think this latter version is much
better and more complete than the one in the CMU Lisp archives.

Both packages use the typical string notation for regexps.

So I don't understand the need to use a C library for regexps.
Unless, perhaps, the Lisp version is much slower than the C library.
I've not benchmarked these, but I do know that the nregexp utility
compiles the regexp so if you have a slow compiler it will run slowly
the first time.  This was a problem with GCL which calls out to C
compiler to compile Lisp code.  I don't use GCL anymore.

Ray

From: Reini Urban
Subject: Re: Parsing not in the spirit of Lisp?
Date: Mon, 04 Oct 1999 00:00:00 +0000
Message-ID: <37f8cd7b.17643860@judy>

Klaus Schilling wrote:
>Is it possible (even when depending on the Implementation) to do
>ffi calls to some regex library?

<and some more string-only? and style discussion>

I wanted to avoid spencer's regex lib for my binding (style), and found
PCRE better.
"Perl-compatible regular expressions", by Philip Hazel <····@cam.ac.uk>
currently v2.06. The license is more liberal than the GPL.

It is readable and you can even get the list of matched substrings,
almost perl-like search/replace.
It supports perl 5.005 syntax for the m// operator, which means also
minimal matches, not only maximal.
but i'm only doing the win32 dll and the autolisp binding.
should be completed end of this month hopefully. 

other lisp bindings to pcre (e.g. to ACL, Corman Lisp, LW, CLISP, ...)
should be straightforward, at least much easier than to call perl or the
old spencer regex or the improved gnu regex.
--
Reini Urban
http://xarch.tu-graz.ac.at/autocad/news/faq/autolisp.html

From: Tim Bradshaw
Subject: Re: Parsing not in the spirit of Lisp?
Date: Fri, 01 Oct 1999 00:00:00 +0000
Message-ID: <ey3yado6i6g.fsf@lostwithiel.tfeb.org>

* Christopher R Barry wrote:

> Well, ANSI Common Lisp is poor for general string processing due to
> its lack of any regular expression facility. It has very powerful
> formatted output capabilities, however.

However it is no poorer than C++, which the original poster was
proposing using, except perhaps insofar as C++ has a reasonably
standard interface to C-written programs.

--tim

From: Duane Rettig
Subject: Re: Parsing not in the spirit of Lisp?
Date: Mon, 04 Oct 1999 00:00:00 +0000
Message-ID: <4hfk7hwuq.fsf@beta.franz.com>

······@2xtreme.net (Christopher R. Barry) writes:

> Marc Cavazza <·········@bradford.ac.uk> writes:
> 
> > Now, getting back to parsing other things than NL, there has been a case
> > made that Lisp was "poor" at string processing (generally by those advocating
> > Lisp/C integration), but that's all I recall
> 
> Well, ANSI Common Lisp is poor for general string processing due to
> its lack of any regular expression facility. It has very powerful
> formatted output capabilities, however.
> 
> You can of course make small Perl or Emacs tools to remedy the regular
> expression deficiency. (Allegro CL also has a regex package, but IMHO
> there is room for a bit of improvement and I'd use Perl or egrep
> instead.)

All the talk of regexps and structure parsing/editing prompted me to go
back to this earlier post.  I'd be interested to know: what is it that
you consider to be that "room for improvement" in Allegro CL's regexp
package?

-- 
Duane Rettig          Franz Inc.            http://www.franz.com/ (www)
1995 University Ave Suite 275  Berkeley, CA 94704
Phone: (510) 548-3600; FAX: (510) 548-8253   ·····@Franz.COM (internet)

From: Marco Antoniotti
Subject: Re: Parsing not in the spirit of Lisp?
Date: Tue, 05 Oct 1999 00:00:00 +0000
Message-ID: <lw1zba8ald.fsf@copernico.parades.rm.cnr.it>

Duane Rettig <·····@franz.com> writes:

> All the talk of regexps and structure parsing/editing prompted me to go
> back to this earlier post.  I'd be interested to know: what is it that
> you consider to be that "room for improvement" in Allegro CL's regexp
> package?

Making it available for CLisp and CMUCL? :)

Cheers

-- 
Marco Antoniotti ===========================================
PARADES, Via San Pantaleo 66, I-00186 Rome, ITALY
tel. +39 - 06 68 10 03 17, fax. +39 - 06 68 80 79 26
http://www.parades.rm.cnr.it/~marcoxa

From: Christopher R. Barry
Subject: Re: Parsing not in the spirit of Lisp?
Date: Wed, 06 Oct 1999 00:00:00 +0000
Message-ID: <87zoxxxrbf.fsf@2xtreme.net>

Marco Antoniotti <·······@copernico.parades.rm.cnr.it> writes:

> Duane Rettig <·····@franz.com> writes:
> 
> > All the talk of regexps and structure parsing/editing prompted me to go
> > back to this earlier post.  I'd be interested to know: what is it that
> > you consider to be that "room for improvement" in Allegro CL's regexp
> > package?

[My news feed has been pretty flaky the past few days. I've waited a
few hours to see this post from Duane but it hasn't shown up yet and
I've got to get going now so hopefully no important text was elided
from Marco's reply.]

My work with regular expressions is (in an indirect way) commercially
related, so until such time that I get a "real" job (I'm 20 years old
as of today--so hopefully that will be Real Soon Now) where licensing
Allegro CL would be an option, I probably will not stress the facility
for writing "real" (non-trivial) text-processing applications. The
experience of having written such an application with Allegro CL would
add significant insightfulness to my commentary, but nevertheless I'll
try to capture the essence of what I feel its limitations and issues
to be based on comparing my user-experience with popular tools for the
task with my experience using Allegro CL and the reading of its
accompanying documentation.

 * Better Allegro CL product integration and more functions

   Perl provides a large number of core operators and functions that
   are regexp-aware. Emacs has a fair number of regexp-aware
   functions. The entire Allegro CL Regular Expression API introduces
   only two functions:

     excl:compile-regexp
     excl:match-regexp

   and no existing functions have had regexp-support added.

   + Example of regexp integration:

     With Emacs you can use regular expressions with apropos to allow
     high-precision matching of symbol-names.

   + Some useful functions every implementation needs:

     Splitting - a way to split a string by a pattern. Emacs has
     `split-string', for example. Perl has a powerful `split'.

     Substitution - something analogous to Perl's `s' operator.

 * More match characters, more precision features, and better interoperability

   Things that work with most other regexp flavors do not work with the
   Allegro one.

   Example of this: I use gnus for reading mail and news and I use
   regexps to split incoming mail into different folders. A sample
   snippet of my ~/.gnus file is:

       ...
       ("cmucl"      "\\(^To\\|^cc\\):.*\\(cmucl\\|ilisp\\)")
       ("cl-http"    "\\(^To\\|^cc\\):.*www-cl")
       ...

   So the first pattern is obviously going to match "To: ·········@cons.org",
   for example. Here is how to do the same pattern for Perl and for egrep:

       (^To|^cc):.*(cmucl|ilisp)

   The only difference is that "(" and "|" need not be escaped.
   However, with Allegro CL:

     USER(1): (match-regexp "\\(^To\\|^cc\\):.*\\(cmucl\\|ilisp\\)"
			    "To: ·········@cons.org")
    => NIL

   The answer to this behavior appears to be (from the doc):

     ^   If this is the first character of the regular expression
	 string it forces the match to start at the beginning of
	 the to-be-matched string. If this character appears after
	 the beginning of the string it stands for itself.

   It says "regular expression _string_", not just "regular
   expression". In my above example "^To" and "^cc" are each regular
   expressions that are building-blocks of a larger regular
   expression. Other tools let you use "^" to match beginning-of-line
   for a regular expression located anywhere in the string, not just
   one that is at the beginning of the string.

   Fortunately, there is an easy (if not superior) workaround for this
   specific case:

     USER(2): (match-regexp "^\\(To\\|cc\\):.*\\(cmucl\\|ilisp\\)"
		            "To: ·········@cons.org")
    => T
    => "To: cmucl"
    => "To"
    => "cmucl"

   The gnus docs use the "\\(^To\\|^cc\\)" style for some reason.
   (Which is why I adopted it without thinking about it.) This is a
   significant incompatibility, however.

   The doc also indicates that a few characters behave differently
   depending on what OS you are running on.

   There are many other significant limitations (which cause
   incompatibility with other tools). This reply is getting real long
   real fast, so I don't have time to go over all of them today
   (another time is a possibility), but here's a particularly bad one
   (from doc):

     The + and * characters must follow a single character regular
     expression. They cannot follow a group expression, even if that
     group matches just one character. In other words \(a\)* is not
     legal. [a]* is legal since the [..] expression always matches one
     character.

   The other tools I've mentioned let you do this. The doc even says:

     The GNU Emacs regular expression parser supports a lot of
     additional features.

   XEmacs in particular supports a lot of features. And nothing seems
   to match Perl currently. "man perlre" will get you the Perl regexp
   doc if you have the Perl distribution and its documentation
   distribution installed and "man perlop" will get you all of the rich
   regexp-aware operators.

If some of these issues were addressed then there would be a good bit
less "room for improvement". But to really go above-and-beyond
expectations and provide a Lispy-implementation there are the issues
of...

   * Extensibility

     Ability to define own match types and have extensibility/control
     analagous to what the Lisp reader provides for processing Lisp
     data. By rebinding a few variables or doing an appropriate
     incantation, it would be cool if you could read in regular
     expressions compatible with POSIX standards, or GNU grep/egrep,
     or Perl. This would ease migration of customer tools written
     using these things to Allegro CL.

  * Non-string representation of regular expression objects

    This whole "regexps as s-exprs" thing on c.l.l lately is of
    interest. Just like you can build pathname-namestrings using Lisp
    objects instead of string concatenation, it would be cool if you
    could build regular expression objects using Allegro CL in a
    likewise fashion. You could represent regular expressions in this
    abstract way, and then get string representations of your regular
    expression objects that are compatible with different flavors
    based on environment settings. (Sort of analogous to READ being
    able to interpret numbers differently based on *read-base*.)

  * Additional Allegro CL integration:

    I can imagine very cool stuff could be done with pathnames and
    their operations by supporting regular expressions with them.

    In general, a fair number of CL functions could be made to support
    regexps in a standard compatible way.

Do you guys ever survey your customers? It might be good to ask them
from time to time if they ever use Perl or other languages as
components of a project they are primarily doing with Allegro CL, and
if so, then inquire as to what functionality absent from Allegro CL
was present in these other tools. For example, some people on this
group (like Tim IIRC) have mentioned using Perl to preprocess text
data to a "Lisp friendly format". If a customer has done this, then
was it because:

  1. Perl built-in core features

     The customer feels that Perl has a rich set of regexp-aware
     operators and functions, and supports more special match
     characters and precision matching features than other
     implementations.

  2. Perl 3rd-party/CPAN libraries

     The customer wished to make use of the large number of
     pre-written, free libraries out there for parsing every kind
     of data under the sun.

  3. Ignorance

     The custromer was too lazy to read the Allegro CL
     documentation overview, and thus didn't even realize that
     there was regexp support in Allegro CL.

Doing something about the 2 variety is probably outside the resources
of Franz (Sun managed the resources to do a 50% job with Java), and
I'm not sure I'd even want customers of the 3 variety (at least not
want to listen to them!), but reasons of the 1 variety are IMO worth
investigation.

I think that's all the time I've got for now. (It's my birthday; I've
gotta _go outside_ today before it's dark!) If I spend more time
thinking about this and/or use the Allegro CL regexp package a bit
more, I'm sure there is a lot of other stuff I could write about.

I very much appreciate your asking me (a non-supported customer) why I
felt there was "room for improvement" with the regular expression
facility provided with Allegro CL.

Christopher

From: Tim Bradshaw
Subject: Re: Parsing not in the spirit of Lisp?
Date: Wed, 06 Oct 1999 00:00:00 +0000
Message-ID: <ey3g0zpxkyj.fsf@lostwithiel.tfeb.org>

* Christopher R Barry wrote:
> Do you guys ever survey your customers? It might be good to ask them
> from time to time if they ever use Perl or other languages as
> components of a project they are primarily doing with Allegro CL, and
> if so, then inquire as to what functionality absent from Allegro CL
> was present in these other tools. For example, some people on this
> group (like Tim IIRC) have mentioned using Perl to preprocess text
> data to a "Lisp friendly format". If a customer has done this, then
> was it because:

[Various reasons elided].

None of the above. I do it like that because Perl (really, awk & sed)
is *good* at that stuff.  It's very unlikely I would do it from within
Lisp even if there was much better regexp support.  That kind of
front-end filtering is what Unix tools are for, and they do it just
fine.  In the same way, I don't do source control with Lisp programs,
and I don't want to, because I have CVS and it does it just fine.

(s/Lisp/C(++)/ above if you like).

What I do in Lisp is the non-trivial (and therefore interesting)
programming stuff, the stuff that if you do it in Perl causes you to
go rapidly insane, and either decide that Perl is the best language in
the world, or get to hate it so much you try and reimplement what it
does in Lisp, which is clearly a folorn hope, scsh notwithstanding.

I'm really not sure that regexps are that interesting in Lisp: most of
the matching tasks you want to do are sufficiently idiosyncratic that
regexps-as-normally-thought-of are not up to it (unless your brain has
been rotted by Perl).  `Structural regexps' (like SRE) are slightly
more interesting because you can do compositional stuff with them
which is hard with stringy regexps.

--tim

From: Duane Rettig
Subject: Re: Parsing not in the spirit of Lisp?
Date: Wed, 06 Oct 1999 00:00:00 +0000
Message-ID: <4btacoocu.fsf@beta.franz.com>

······@2xtreme.net (Christopher R. Barry) writes:

> My work with regular expressions is (in an indirect way) commercially
> related, so until such time that I get a "real" job (I'm 20 years old
> as of today--

Happy birthday.

> so hopefully that will be Real Soon Now) where licensing
> Allegro CL would be an option, I probably will not stress the facility
> for writing "real" (non-trivial) text-processing applications. The
> experience of having written such an application with Allegro CL would
> add significant insightfulness to my commentary, but nevertheless I'll
> try to capture the essence of what I feel its limitations and issues
> to be based on comparing my user-experience with popular tools for the
> task with my experience using Allegro CL and the reading of its
> accompanying documentation.

I asked the question because a colleague of mine, another developer,
is working on the regexp facility for a specific purpose, and he in
fact has given me his comments to your posting.  BTW, he considered
your posting to be very insightful.

===
Warning! Lisp Purists and Perl Haters, please turn the other way
while I finish this post talking about Perl as well as regexps!
===

I don't hack perl myself, but this other developer does, and he
got tired of waiting for his perl scripts to fire up (he said
that one gets numbed to the fact that it takes a few seconds for
the perl system to initialize itself when it starts).  He also got
tired of not having the power of lisp behind his scripts, and felt
that he could do much better in lisp, so he set out to build a
perl-in-lisp scripting system.  I have seen the results: when
the lisp is paged in, it starts up instantaneously and is faster
than the perl scripts, even when running in a lisp without a
compiler.  He intends to release this module as a patch to be
usable in some way by anyone for a scripting language.  We'll
work out what kind of distributability it might have.

Some of the comments below are the other developer's, changed by
me to third-person, and somewhat editorialized.  Some describe the
work he is doing as part of his perl-in-lisp module.

>  * Better Allegro CL product integration and more functions
> 
>    Perl provides a large number of core operators and functions that
>    are regexp-aware. Emacs has a fair number of regexp-aware
>    functions. The entire Allegro CL Regular Expression API introduces
>    only two functions:
>   
>      excl:compile-regexp
>      excl:match-regexp
>   
>    and no existing functions have had regexp-support added.
>   
>    + Example of regexp integration:
>   
>      With Emacs you can use regular expressions with apropos to allow
>      high-precision matching of symbol-names.

It would be nice to use regexp where they make sense, like in apropos.
There is actually an internal function called excl::apropos-regexp
(an apropos would have found it :-) which is used by the emacs-lisp
interface.  Example:

USER(1): (excl::apropos-regexp "^C[D]*R$")
CDR                 [function] (LIST)
CDDDDR              [function] (LIST)
CDDR                [function] (LIST)
CDDDR               [function] (LIST)
:CDR                value: :CDR
USER(2): 

>    + Some useful functions every implementation needs:
>   
>      Splitting - a way to split a string by a pattern. Emacs has
>      `split-string', for example. Perl has a powerful `split'.

He has added split-regexp.

>      Substitution - something analogous to Perl's `s' operator.

He has added replace-regexp.

>  * More match characters, more precision features, and better interoperability
> 
>    Things that work with most other regexp flavors do not work with the
>    Allegro one.
> 
>    Example of this: I use gnus for reading mail and news and I use
>    regexps to split incoming mail into different folders. A sample
>    snippet of my ~/.gnus file is:
> 
>        ...
>        ("cmucl"      "\\(^To\\|^cc\\):.*\\(cmucl\\|ilisp\\)")
>        ("cl-http"    "\\(^To\\|^cc\\):.*www-cl")
>        ...
> 
>    So the first pattern is obviously going to match "To: ·········@cons.org",
>    for example. Here is how to do the same pattern for Perl and for egrep:
> 
>        (^To|^cc):.*(cmucl|ilisp)
> 
>    The only difference is that "(" and "|" need not be escaped.

He now has a mode where (, ) and | do not need \\ in a string.

>    However, with Allegro CL:
> 
>      USER(1): (match-regexp "\\(^To\\|^cc\\):.*\\(cmucl\\|ilisp\\)"
> 			    "To: ·········@cons.org")
>     => NIL
> 
>    The answer to this behavior appears to be (from the doc):
> 
>      ^   If this is the first character of the regular expression
> 	 string it forces the match to start at the beginning of
> 	 the to-be-matched string. If this character appears after
> 	 the beginning of the string it stands for itself.
> 
>    It says "regular expression _string_", not just "regular
>    expression". In my above example "^To" and "^cc" are each regular
>    expressions that are building-blocks of a larger regular
>    expression. Other tools let you use "^" to match beginning-of-line
>    for a regular expression located anywhere in the string, not just
>    one that is at the beginning of the string.

He plans to add ^ and $ matching for lines in a string.  This isn't
implemented yet, though.

>    Fortunately, there is an easy (if not superior) workaround for this
>    specific case:
> 
>      USER(2): (match-regexp "^\\(To\\|cc\\):.*\\(cmucl\\|ilisp\\)"
> 		            "To: ·········@cons.org")
>     => T
>     => "To: cmucl"
>     => "To"
>     => "cmucl"
> 
>    The gnus docs use the "\\(^To\\|^cc\\)" style for some reason.
>    (Which is why I adopted it without thinking about it.) This is a
>    significant incompatibility, however.
> 
>    The doc also indicates that a few characters behave differently
>    depending on what OS you are running on.

We'e not aware of this.  Which ones?

>    There are many other significant limitations (which cause
>    incompatibility with other tools). This reply is getting real long
>    real fast, so I don't have time to go over all of them today
>    (another time is a possibility), but here's a particularly bad one
>    (from doc):
> 
>      The + and * characters must follow a single character regular
>      expression. They cannot follow a group expression, even if that
>      group matches just one character. In other words \(a\)* is not
>      legal. [a]* is legal since the [..] expression always matches one
>      character.

He's working on allowing +, * and (the new) ? to follow a group.  This
is tough, since he's breaking new ground.  All the previous changes
slipped relatively easily into the current code. 

>    The other tools I've mentioned let you do this. The doc even says:
> 
>      The GNU Emacs regular expression parser supports a lot of
>      additional features.
> 
>    XEmacs in particular supports a lot of features. And nothing seems
>    to match Perl currently. "man perlre" will get you the Perl regexp
>    doc if you have the Perl distribution and its documentation
>    distribution installed and "man perlop" will get you all of the rich
>    regexp-aware operators.
> 
> If some of these issues were addressed then there would be a good bit
> less "room for improvement". But to really go above-and-beyond
> expectations and provide a Lispy-implementation there are the issues
> of...
> 
>    * Extensibility
> 
>      Ability to define own match types and have extensibility/control
>      analagous to what the Lisp reader provides for processing Lisp
>      data. By rebinding a few variables or doing an appropriate
>      incantation, it would be cool if you could read in regular
>      expressions compatible with POSIX standards, or GNU grep/egrep,
>      or Perl. This would ease migration of customer tools written
>      using these things to Allegro CL.

This is what he's done.  There is a special controlling "perl mode"
for regexps.

>   * Non-string representation of regular expression objects
> 
>     This whole "regexps as s-exprs" thing on c.l.l lately is of
>     interest. Just like you can build pathname-namestrings using Lisp
>     objects instead of string concatenation, it would be cool if you
>     could build regular expression objects using Allegro CL in a
>     likewise fashion. You could represent regular expressions in this
>     abstract way, and then get string representations of your regular
>     expression objects that are compatible with different flavors
>     based on environment settings. (Sort of analogous to READ being
>     able to interpret numbers differently based on *read-base*.)

The motivation for this is hard to see.  It would be a fair amount
of work (much in the design, probably).  Speaking for myself, I
would prefer not to cram the concept of structure manipulation into
an essentially string-handling tool.  There are structure editors
out there; FranzLisp had a fairly powerful one, and CL has an
inspector (though not specified and thus implementation dependent).
Perhaps non-string manipulations would better be addressed by
enhancing structure-examination facilities instead.

>   * Additional Allegro CL integration:
> 
>     I can imagine very cool stuff could be done with pathnames and
>     their operations by supporting regular expressions with them.

We already support globbing, which is a little regexp-like.  But he
agrees.  On the other hand, overloading pathnames with more gunk might be
unpopular.

>     In general, a fair number of CL functions could be made to support
>     regexps in a standard compatible way.
> 
> Do you guys ever survey your customers? It might be good to ask them
> from time to time if they ever use Perl or other languages as
> components of a project they are primarily doing with Allegro CL, and
> if so, then inquire as to what functionality absent from Allegro CL
> was present in these other tools. For example, some people on this
> group (like Tim IIRC) have mentioned using Perl to preprocess text
> data to a "Lisp friendly format". If a customer has done this, then
> was it because:
> 
>   1. Perl built-in core features
> 
>      The customer feels that Perl has a rich set of regexp-aware
>      operators and functions, and supports more special match
>      characters and precision matching features than other
>      implementations.

It is his goal to satify (1).

>   2. Perl 3rd-party/CPAN libraries
> 
>      The customer wished to make use of the large number of
>      pre-written, free libraries out there for parsing every kind
>      of data under the sun.

He's talked with people and thought about this lately.  Perl is so
successsful precisely because of (2). The reasons it is able to win in
this area:

a. it has a strong heirachical module system (He's thinking about
   heirarchical packages ala the Lisp machine).  This is important
   because it allows people to write add-ons that do not collide with one
   another.
b. perl is free.
c. perl is economical (you can write useful code that is small, even
   if it is hard to understand).
d. perl has a rich set of primitives (system calls and regexp), enough
   to build useful stuff.

He plans to solve (c) and (d) with his new shell module and regexp
changes.  For (b), we are looking into maybe providing a lisp without
a compiler.  He believes that most uses of the perl-in-lisp code won't
suffer from lack of a compiler.  (a) we could add to acl, but it might
have other ramifications that might cause it not to be a good decision.

>   3. Ignorance
> 
>      The custromer was too lazy to read the Allegro CL
>      documentation overview, and thus didn't even realize that
>      there was regexp support in Allegro CL.

Yes, we need to do better in laying out our docs.  We're working on
that on a continuous basis.

> Doing something about the 2 variety is probably outside the resources
> of Franz (Sun managed the resources to do a 50% job with Java), and
> I'm not sure I'd even want customers of the 3 variety (at least not
> want to listen to them!), but reasons of the 1 variety are IMO worth
> investigation.

It is my belief that my colleague considers all three issues important,
since he is a perl programmer himself (reluctantly) and is moving to
provide himself with the power to switch his scripts to lisp.

> I think that's all the time I've got for now. (It's my birthday; I've
> gotta _go outside_ today before it's dark!) If I spend more time
> thinking about this and/or use the Allegro CL regexp package a bit
> more, I'm sure there is a lot of other stuff I could write about.
> 
> I very much appreciate your asking me (a non-supported customer) why I
> felt there was "room for improvement" with the regular expression
> facility provided with Allegro CL.

You provided good feedback.  Thank you.

-- 
Duane Rettig          Franz Inc.            http://www.franz.com/ (www)
1995 University Ave Suite 275  Berkeley, CA 94704
Phone: (510) 548-3600; FAX: (510) 548-8253   ·····@Franz.COM (internet)

From: Christopher R. Barry
Subject: Re: Parsing not in the spirit of Lisp?
Date: Wed, 06 Oct 1999 00:00:00 +0000
Message-ID: <87puysxjhi.fsf@2xtreme.net>

Duane Rettig <·····@franz.com> writes:

> >    The doc also indicates that a few characters behave differently
> >    depending on what OS you are running on.
> 
> We'e not aware of this.  Which ones?

I misread the "Compatibility with other regular expression parsers"
paragraph and the subsequent one. Sorry about that.

Christopher

From: Fernando Mato Mira
Subject: Re: Parsing not in the spirit of Lisp?
Date: Fri, 01 Oct 1999 00:00:00 +0000
Message-ID: <37F49C53.AD989E8@iname.com>

Marc Cavazza wrote:

> Mitchell Morris wrote:
>
> > I've now read from several sources that parsing isn't in the spirit of Lisp.
>
> I'm not sure to understand this (probably because I develop NL parsers
> in Lisp :-)

Easy. Because parsing anything other than NL is boring, bureaucratic, a waste of
time/life/money and makes no sense.
[parsing s-exprs is bureaucratic, makes A LOT of sense and is so trivial there's
not enough time to get bored ;-)]

The right phrasing would be "Parsing SYNTHETIC languages is not in the spirit of
LispERS"

From: ···········@alcoa.com
Subject: Re: Parsing not in the spirit of Lisp?
Date: Thu, 30 Sep 1999 00:00:00 +0000
Message-ID: <7t069d$agh$1@nnrp1.deja.com>

In article <··················@unpkhswm04.bscc.bls.com>,
  ········@morrisland.com wrote:
> I've now read from several sources that parsing isn't in the spirit of
Lisp.
> I must admit this confuses me, in that I also read Lisp advocacy that
it
> shouldn't be necessary to use other inferior languages (like C++) to
> accomplish any significant task.
>
> So tell me ... what's a boy to do? Write some not-Lisp-y Lisp to parse
the
> data? Write a C++ preprocessor and insert it into the workflow ahead
of the
> Lisp process?

I wasn't aware that C++ was renowned as a parsing language. Perhaps you
can show us some eye-popping parsing code written in C++ and we'll
demonstrate its inferiorities. Or perhaps you meant Perl?

--
John Watton
Alcoa Inc.

Sent via Deja.com http://www.deja.com/
Before you buy.

From: Mitchell Morris
Subject: Re: Parsing not in the spirit of Lisp?
Date: Fri, 01 Oct 1999 00:00:00 +0000
Message-ID: <slrn7v9a8a.4ku.mgm@unpkhswm04.bscc.bls.com>

In article <············@nnrp1.deja.com>, ···········@alcoa.com wrote:
[snip]
>
>I wasn't aware that C++ was renowned as a parsing language. Perhaps you
>can show us some eye-popping parsing code written in C++ and we'll
>demonstrate its inferiorities. Or perhaps you meant Perl?
>
>--
>John Watton
>Alcoa Inc.

First, let me say that I appreciate all the replies I've gotten. I had asked
one of the participants in the "reading XML" thread here in comp.lang.lisp
about parsing non-trivial textual inputs. He replied that it wasn't really
Lisp-y in nature, and pointed me to H.Baker's collection of writings,
including the META paper. From this sample of two, plus a general impression
from the newsgroup, I inferred that parsing wasn't really Lisp-y. It appears
that I was mistaken.

To answer your question, if I have to parse a non-trivial textual input in my
real job (writing C++ for BellSouth), I'd be using "lex" and "yacc" (with
some compile-time rigamarole) where I write stuff like:

    recordlist : record { $$ = new vector<record_t>; $$->push_back(*$1); }
         | recordlist record { $1->push_back(*$2); $$ = $1; }
         ;

    record : lx_record lx_name fieldlist {
    	     $$ = new record_t; $$->name = $2; $$->fields = $3;
         }
         ;

and the tools generate lots and lots and lots of really ugly C (which is also
valid C++) code that I don't have to care about. Note that I do see how much
like META this is ... seems reasonable since they're both started from the
same place. Baker, however, says that he isn't comfortable with this
technique in that it isn't very Lisp-y in nature.

I'd also do about the same thing in Perl, except I'd use "Parse::Yapp" or
"Parse::RecDescent" instead of "lex" and "yacc".

I'm not actually asking how to do translate this into Lisp, but how to think
in Lisp about solving the same sorts of problems. As an aside, I'd also like
to know how to research the tools available for any given environment. I've
grown quite fond of "CPAN" as a tool-acquisition tool and wonder if there's
anything like it for any other environment. In another reply to this same
thread, I've already been given the name "zebu" and I'm curious if I could
have found it by another mechanism.

+Mitchell

-- 
Mitchell Morris

Just because something is obviously happening doesn't mean something obvious
is happening.
	-- Larry Wall

From: Duane Rettig
Subject: Re: Parsing not in the spirit of Lisp?
Date: Thu, 30 Sep 1999 00:00:00 +0000
Message-ID: <4btakuy9r.fsf@beta.franz.com>

···@unpkhswm04.bscc.bls.com (Mitchell Morris) writes:

> I've now read from several sources that parsing isn't in the spirit of Lisp.

Maybe I have missed it, but I don't think it has come from this newsgroup.
You may be misinterpreting what is being said.  See below.

> I must admit this confuses me, in that I also read Lisp advocacy that it
> shouldn't be necessary to use other inferior languages (like C++) to
> accomplish any significant task.

It is very easy to write parsing code in lisp.  In fact, one such
parser has already been written for you, namely the lisp function
READ.  Now, what you may be hearing is that it is not good to use
lisp's READ function to parse non-lisp code.  This is as true as
the fact that it is not a good idea to try to use the C++ parser
(assuming you can get your hands on it) to try to parse lisp.  What
you would do instead is to write a C++ _program_ to parse lisp code,
and what you would do in the case of lisp is to write lisp code
to parse C++, if that is what you desire.

Note two points that might be considered excepttions to the above:

 1. In our free CBIND tool, which parses C/C++ code and outputs
lisp foreign-function specifications in lisp formats, we didn't
bother to write a new parser, because C++ is so hard to parse.
Instead we simply used the gcc front-end and modified it.  This
was a maintenance decision (especially since it is free) so that
we are not bound to knowing the current trends in C++ parsing
technology.

 2. Although I agree with others that it is not a very good idea
to use Lisp's READ function to parse non-lisp data, it is still
an incredibly versatile function, with character macros and almost
enough programmability to do any kind of parsing you want.  (Perhaps
it is this versatility that leads to the incorrect conclusion that
lisp can parse non-lisp, but not very well).  Anyway, one thing I miss
in CL is the FranzLisp notion of infix character macros (I think they
may have come originally from MacLisp), and with a judicious redefinition
of the internal read-list function, I can get some infix macro capability.
This allows me to port an old English-grammar parser from FranzLisp to
Allegro CL which, among other things, can parse the following as if it
were a sentence:

(John's going to Mary's house.)

Note the use of the apostrpohe in spite of the fact that it is also the
lisp "quote" character, and the period at the end.  Note also that other
than the parens around the sentence, it looks very much like a sentence
in English.  Note also, however, that this uses a couple of non-CL-conforming
features, and so is why it might not really be considered an exception.

> I can reconcile these two concepts when I'm trying to read data that I
> generate myself, but I can't seem to reconcile them when I'm trying to read
> the data supplied by my client(s). It's perhaps "purer" (for some definition
> of "pure") to claim that the input data should already be in a Lisp-friendly
> format, but it doesn't seem practical to push back on my customers like that
> in every (or even many) cases.

"Pure" here means that if you want to parse lisp, you can use the already-
supplied READ function, but if you want to parse another language, it may
be harder to use READ for that.

> So tell me ... what's a boy to do? Write some not-Lisp-y Lisp to parse the
> data? Write a C++ preprocessor and insert it into the workflow ahead of the
> Lisp process? Quit programming and take up burglary?

You might consider any of the above, depending on your morals (and I am not
necessarily referring to burglary :-).  However, saying the phrase
"Write some not-Lisp-y Lisp" is missing the point; you should definitely
write your lisp code in a lispy way; it will simply not use READ and its
functionality will be geared toward parsing C code.

-- 
Duane Rettig          Franz Inc.            http://www.franz.com/ (www)
1995 University Ave Suite 275  Berkeley, CA 94704
Phone: (510) 548-3600; FAX: (510) 548-8253   ·····@Franz.COM (internet)

From: William Deakin
Subject: Re: Parsing not in the spirit of Lisp?
Date: Fri, 01 Oct 1999 00:00:00 +0000
Message-ID: <37F481E0.6A844336@pindar.com>

Duane Rettig wrote:

> ···@unpkhswm04.bscc.bls.com (Mitchell Morris) writes:
> > Quit programming and take up burglary?
>
> You might consider any of the above, depending on your morals (and I am not
> necessarily referring to burglary :-)

But do, if you can make a sucess of it ;)
Me, I'm just too honest. "It's a fair cop, guv'nor. You got me bang to rights"

Best Regards,

:) will

From: Paolo Amoroso
Subject: Re: Parsing not in the spirit of Lisp?
Date: Fri, 01 Oct 1999 00:00:00 +0000
Message-ID: <37f7d23a.4969739@news.mclink.it>

On 30 Sep 1999 10:04:00 -0700, Duane Rettig <·····@franz.com> wrote:

> ···@unpkhswm04.bscc.bls.com (Mitchell Morris) writes:
> 
> > I've now read from several sources that parsing isn't in the spirit of Lisp.
> 
> Maybe I have missed it, but I don't think it has come from this newsgroup.
> You may be misinterpreting what is being said.  See below.

He probably referred to article "Re: newbie needs a little help parsing a
comma delimited file" (message ID ···············@world.std.com) by Kent
Pitman.


Paolo
-- 
EncyCMUCLopedia * Extensive collection of CMU Common Lisp documentation
http://cvs2.cons.org:8000/cmucl/doc/EncyCMUCLopedia/

From: Tim Bradshaw
Subject: Re: Parsing not in the spirit of Lisp?
Date: Thu, 30 Sep 1999 00:00:00 +0000
Message-ID: <ey3u2ocs5ux.fsf@lostwithiel.tfeb.org>

* Mitchell Morris wrote:

> So tell me ... what's a boy to do? Write some not-Lisp-y Lisp to parse the
> data? Write a C++ preprocessor and insert it into the workflow ahead of the
> Lisp process? Quit programming and take up burglary?

What I do is write perl/awk/whatever scripts that take whatever random
format I get and spit out something lisp-readable.  Some people
probably think this is a bad solution but it works OK for me -- the
perl scripts stay small and comprehensible as they just do one thing,
and the Lisp stuff doesn't need to do any unneeded string hacking.

--tim

From: Lieven Marchand
Subject: Re: Parsing not in the spirit of Lisp?
Date: Thu, 30 Sep 1999 00:00:00 +0000
Message-ID: <m3zoy4e2ay.fsf@localhost.localdomain>

···@unpkhswm04.bscc.bls.com (Mitchell Morris) writes:

> So tell me ... what's a boy to do? Write some not-Lisp-y Lisp to parse the
> data? Write a C++ preprocessor and insert it into the workflow ahead of the
> Lisp process? Quit programming and take up burglary?
> 

There are lispy approaches to parsing too: Henry Baker's paper on
META, Vaughn's approach (can't remember the exact name for the
moment). And there are even yacc clones like zebu available.

The Lisp emphasis on infix and read syntax and load forms tends to
help to stay clear from inventing formats that are too difficult to
parse. The cost of such formats is not only or not even predominantly
in the complexity of the parser but in the maintainability of the
content. If it is hard to parse it is also hard to be sure as producer
how it is going to be parsed (cfr. C++ where writing a parser is a
large project and where heaps of even experienced programmers get
caught in all sorts of gotcha's). On the other hand infix hasn't saved
SGML from being difficult to parse.

-- 
Lieven Marchand <···@bewoner.dma.be>
If there are aliens, they play Go. -- Lasker

From: Arthur Lemmens
Subject: Re: Parsing not in the spirit of Lisp?
Date: Thu, 30 Sep 1999 00:00:00 +0000
Message-ID: <37F3C6B0.93F6C369@simplex.nl>

Lieven Marchand wrote:
> 
> There are lispy approaches to parsing too: Henry Baker's paper on
> META, Vaughn's approach (can't remember the exact name for the
> moment). And there are even yacc clones like zebu available.

Using 'parser combinators' is also an interesting technique from the 
functional programming community. To quote Graham Hutton and Erik Meijer 
from a paper called "Monadic Parser Combinators":
"In functional programming, a popular approach to building recursive
descent parsers is to model parsers as functions, and to define
higher-order functions (or 'combinators') that implement grammar
constructions such as sequencing, choice and repetition."

I've built a small library of parser combinators in CL. In my experience
using parsing combinators is a compact, elegant and usable (if not extremely 
efficient) technique for many parsing problems. The downside is that you 
can't easily implement left-recursive grammars; IIRC, this can also be a 
problem with Henry Baker's neat META hacks. 

> The Lisp emphasis on infix and read syntax and load forms tends to
> help to stay clear from inventing formats that are too difficult to
> parse. The cost of such formats is not only or not even predominantly
> in the complexity of the parser but in the maintainability of the
> content. If it is hard to parse it is also hard to be sure as producer
> how it is going to be parsed (cfr. C++ where writing a parser is a
> large project and where heaps of even experienced programmers get
> caught in all sorts of gotcha's). On the other hand infix hasn't saved
> SGML from being difficult to parse.

I suppose you mean 'prefix' instead of 'infix'.

--
Arthur Lemmens