REGEXPS

From: Vladimir Zolotykh
Subject: REGEXPS
Date: Mon, 05 Dec 2005 10:56:12 +0000
Message-ID: <dn16c9$c0a$1@dcs.eurocom.od.ua>

Should them be first or last resort? Should we use them anywhere we
can or should we strive to avoid them, use them only when no other
way?

Suppose I have to parse some date format like

   Mon 05 Dec 12:25:35 2005 EET

I could use regexps or write a function PARSE-DATE using only CHAR or
READ-CHAR functions. If we think of the regexp solution the merits are
obvious: 1) it's almost instant solution 2) you don't need to test it
3) it's almost selfdocumeted, e.g. looking at it you easily see what
it does, the meaning and purpose of it 4) if you need to change parse
format it's easily and quickly done. The only drawback (and I'm afraid
it's subjective) is that my conscious blame me a little as if I was
doing something not as thoroughly as I might be.

If we think of self-made parse function 1,2,3,4 become reverse (for
me). It's not so fast, needs to be debugged, not so obvious in meaning
unless you commented it and if you need to changes input format it
takes some time to make needed modifications specially if you wrote
the function long ago. The only merit is that you know what you're
doing and doing that gives you the pleasure.

May be I'm totally wrong? Considering that regexps quite popular
nowadays. I can hardly imagine any Perl program w/o extensive use of
them, Mutt is based on them, etc.

Of course I can (and sometimes do) use them in CL, however still have
doubts...

-- 
Vladimir Zolotykh

Re: REGEXPS Christophe Rhodes
Re: REGEXPS Pascal Bourguignon
- Re: REGEXPS Pascal Bourguignon
  - Re: REGEXPS Edi Weitz
  - Re: REGEXPS Vladimir Zolotykh
    - Re: REGEXPS Pascal Bourguignon
- Re: REGEXPS Vladimir Zolotykh
  - Re: REGEXPS Edi Weitz
  - Re: REGEXPS Pascal Bourguignon
Re: REGEXPS N. Raghavendra

From: Christophe Rhodes
Subject: Re: REGEXPS
Date: Mon, 05 Dec 2005 11:06:42 +0000
Message-ID: <sqslt7erb1.fsf@cam.ac.uk>

Vladimir Zolotykh <······@eurocom.od.ua> writes:

> Should them be first or last resort? Should we use them anywhere we
> can or should we strive to avoid them, use them only when no other
> way?

My advice would be that, if you're happy with whichever regexp package
you have, to use regexps to parse regular languages, and usually not
in other cases.

Christophe

From: Pascal Bourguignon
Subject: Re: REGEXPS
Date: Mon, 05 Dec 2005 11:09:59 +0000
Message-ID: <87d5kbdcl4.fsf@thalassa.informatimago.com>

Vladimir Zolotykh <······@eurocom.od.ua> writes:

> Should them be first or last resort? Should we use them anywhere we
> can or should we strive to avoid them, use them only when no other
> way?
>
> Suppose I have to parse some date format like
>
>    Mon 05 Dec 12:25:35 2005 EET
>
> I could use regexps or write a function PARSE-DATE using only CHAR or
> READ-CHAR functions. If we think of the regexp solution the merits are
> obvious: 1) it's almost instant solution 2) you don't need to test it
> 3) it's almost selfdocumeted, e.g. looking at it you easily see what
> it does, the meaning and purpose of it 4) if you need to change parse
> format it's easily and quickly done. The only drawback (and I'm afraid
> it's subjective) is that my conscious blame me a little as if I was
> doing something not as thoroughly as I might be.
>
> If we think of self-made parse function 1,2,3,4 become reverse (for
> me). It's not so fast, needs to be debugged, not so obvious in meaning
> unless you commented it and if you need to changes input format it
> takes some time to make needed modifications specially if you wrote
> the function long ago. The only merit is that you know what you're
> doing and doing that gives you the pleasure.
>
> May be I'm totally wrong? Considering that regexps quite popular
> nowadays. I can hardly imagine any Perl program w/o extensive use of
> them, Mutt is based on them, etc.
>
> Of course I can (and sometimes do) use them in CL, however still have
> doubts...

Depends if your language is regular or not.

-- 
__Pascal_Bourguignon__               _  Software patents are endangering
()  ASCII ribbon against html email (o_ the computer industry all around
/\  1962:DO20I=1.100                //\ the world http://lpf.ai.mit.edu/
    2001:my($f)=`fortune`;          V_/   http://petition.eurolinux.org/

From: Pascal Bourguignon
Subject: Re: REGEXPS
Date: Mon, 05 Dec 2005 11:30:41 +0000
Message-ID: <8764q3dbmm.fsf@thalassa.informatimago.com>

Pascal Bourguignon <····@mouse-potato.com> writes:

> Vladimir Zolotykh <······@eurocom.od.ua> writes:
>
>> Should them be first or last resort? Should we use them anywhere we
>> can or should we strive to avoid them, use them only when no other
>> way?
>>
>> Suppose I have to parse some date format like
>>
>>    Mon 05 Dec 12:25:35 2005 EET
>>
>> I could use regexps or write a function PARSE-DATE using only CHAR or
>> READ-CHAR functions. If we think of the regexp solution the merits are
>> obvious: 1) it's almost instant solution 2) you don't need to test it
>> 3) it's almost selfdocumeted, e.g. looking at it you easily see what
>> it does, the meaning and purpose of it 4) if you need to change parse
>> format it's easily and quickly done. The only drawback (and I'm afraid
>> it's subjective) is that my conscious blame me a little as if I was
>> doing something not as thoroughly as I might be.
>>
>> If we think of self-made parse function 1,2,3,4 become reverse (for
>> me). It's not so fast, needs to be debugged, not so obvious in meaning
>> unless you commented it and if you need to changes input format it
>> takes some time to make needed modifications specially if you wrote
>> the function long ago. The only merit is that you know what you're
>> doing and doing that gives you the pleasure.
>>
>> May be I'm totally wrong? Considering that regexps quite popular
>> nowadays. I can hardly imagine any Perl program w/o extensive use of
>> them, Mutt is based on them, etc.
>>
>> Of course I can (and sometimes do) use them in CL, however still have
>> doubts...
>
> Depends if your language is regular or not.

I should add a precision.  Of course, if you want to parse
parentheses, it's not a regular language so you need a parser.  But
some languages are so weak they don't need regular expressions either.

For example:

     Mon 05 Dec 12:25:35 2005 EET
     DOW NN MMM HH MM SS YYYY ZZZ

Why would you need to parse this?
You can access the fields without any parsing.


[1]> (defmacro defrecord (name &rest fields)
  `(progn
     ,@(mapcan (lambda (field)
                 `((defun ,(intern (format nil "~A-~A"
                                           name (first field)))
                       (record)
                     ,(case (fourth field)
                       (integer `(parse-integer
                                  (subseq record ,(second field)
                                          ,(+ (second field) (third field)))))
                       (otherwise `(subseq record ,(second field)
                                          ,(+ (second field) (third field))))))
                   (defun (setf ,(intern (format nil "~A-~A"
                                                 name (first field))))
                       (value record)
                     ,(case (fourth field)
                       (integer
                        `(replace record (format nil
                                           ,(format nil "~~~D,'0D" (third field))
                                           value)
                                  :start1 ,(second field)
                                  :end1 ,(+ (second field) (third field))))
                       (otherwise
                        `(replace record value
                                  :start1 ,(second field)
                                  :end1 ,(+ (second field) (third field))))))))
               fields)
     ',name))
DEFRECORD
[2]> (macroexpand-1 '(defrecord date
    ;; Mon 05 Dec 12:25:35 2005 EET
    ;; 0123456789012345678901234567890
    ;; 0000000000111111111122222222223
  (dow      0 3)
  (day      4 2 integer)
  (month    7 3)
  (hour    11 2 integer)
  (minute  14 2 integer)
  (second  17 2 integer)
  (year    20 4 integer)
  (zone    25 3))

)
(PROGN (DEFUN DATE-DOW (RECORD) (SUBSEQ RECORD 0 3))
 (DEFUN (SETF DATE-DOW) (VALUE RECORD) (REPLACE RECORD VALUE :START1 0 :END1 3))
 (DEFUN DATE-DAY (RECORD) (PARSE-INTEGER (SUBSEQ RECORD 4 6)))
 (DEFUN (SETF DATE-DAY) (VALUE RECORD)
  (REPLACE RECORD (FORMAT NIL "~2,'0D" VALUE) :START1 4 :END1 6))
 (DEFUN DATE-MONTH (RECORD) (SUBSEQ RECORD 7 10))
 (DEFUN (SETF DATE-MONTH) (VALUE RECORD)
  (REPLACE RECORD VALUE :START1 7 :END1 10))
 (DEFUN DATE-HOUR (RECORD) (PARSE-INTEGER (SUBSEQ RECORD 11 13)))
 (DEFUN (SETF DATE-HOUR) (VALUE RECORD)
  (REPLACE RECORD (FORMAT NIL "~2,'0D" VALUE) :START1 11 :END1 13))
 (DEFUN DATE-MINUTE (RECORD) (PARSE-INTEGER (SUBSEQ RECORD 14 16)))
 (DEFUN (SETF DATE-MINUTE) (VALUE RECORD)
  (REPLACE RECORD (FORMAT NIL "~2,'0D" VALUE) :START1 14 :END1 16))
 (DEFUN DATE-SECOND (RECORD) (PARSE-INTEGER (SUBSEQ RECORD 17 19)))
 (DEFUN (SETF DATE-SECOND) (VALUE RECORD)
  (REPLACE RECORD (FORMAT NIL "~2,'0D" VALUE) :START1 17 :END1 19))
 (DEFUN DATE-YEAR (RECORD) (PARSE-INTEGER (SUBSEQ RECORD 20 24)))
 (DEFUN (SETF DATE-YEAR) (VALUE RECORD)
  (REPLACE RECORD (FORMAT NIL "~4,'0D" VALUE) :START1 20 :END1 24))
 (DEFUN DATE-ZONE (RECORD) (SUBSEQ RECORD 25 28))
 (DEFUN (SETF DATE-ZONE) (VALUE RECORD)
  (REPLACE RECORD VALUE :START1 25 :END1 28))
 'DATE) ;
T
[3]> (defrecord date
    ;; Mon 05 Dec 12:25:35 2005 EET
    ;; 0123456789012345678901234567890
    ;; 0000000000111111111122222222223
  (dow      0 3)
  (day      4 2 integer)
  (month    7 3)
  (hour    11 2 integer)
  (minute  14 2 integer)
  (second  17 2 integer)
  (year    20 4 integer)
  (zone    25 3))


DATE
[4]> (defparameter d (copy-seq "Mon 05 Dec 12:25:35 2005 EET"))
D
[5]> (date-day d)
5 ;
2
[8]> (date-month d)
"Dec"
[9]> (date-minute d)
25 ;
2
[10]> (setf (date-minute d) 37)
"Mon 05 Dec 12:37:35 2005 EET"
[11]> d
"Mon 05 Dec 12:37:35 2005 EET"
[12]> (setf (date-month d) "Jan")
"Mon 05 Jan 12:37:35 2005 EET"
[13]> d
"Mon 05 Jan 12:37:35 2005 EET"
[14]> 


     
-- 
__Pascal_Bourguignon__               _  Software patents are endangering
()  ASCII ribbon against html email (o_ the computer industry all around
/\  1962:DO20I=1.100                //\ the world http://lpf.ai.mit.edu/
    2001:my($f)=`fortune`;          V_/   http://petition.eurolinux.org/

From: Edi Weitz
Subject: Re: REGEXPS
Date: Mon, 05 Dec 2005 11:44:52 +0000
Message-ID: <u4q5n4vkb.fsf@agharta.de>

On Mon, 05 Dec 2005 12:30:41 +0100, Pascal Bourguignon <····@mouse-potato.com> wrote:

> I should add a precision.  Of course, if you want to parse
> parentheses, it's not a regular language so you need a parser.  But
> some languages are so weak they don't need regular expressions
> either.
>
> For example:
>
>      Mon 05 Dec 12:25:35 2005 EET
>      DOW NN MMM HH MM SS YYYY ZZZ
>
> Why would you need to parse this?

As the OP said: Because the resulting code (using regular expressions)
might be much shorter and easier to understand than your homegrown
code.

Why would you want to write programs in Lisp when everything can be
done in Assembler?

Cheers,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: Vladimir Zolotykh
Subject: Re: REGEXPS
Date: Mon, 05 Dec 2005 11:58:56 +0000
Message-ID: <dn1a1t$4n4$1@dcs.eurocom.od.ua>

Pascal Bourguignon wrote:
> For example:
> 
>      Mon 05 Dec 12:25:35 2005 EET
>      DOW NN MMM HH MM SS YYYY ZZZ
> 
> Why would you need to parse this?
Because usually I don't rely on that the number of spaces is fixed or
predetermined as well as then number of digits and ôZ might be not only EET.

-- 
Vladimir Zolotykh

From: Pascal Bourguignon
Subject: Re: REGEXPS
Date: Mon, 05 Dec 2005 12:23:07 +0000
Message-ID: <87slt7bums.fsf@thalassa.informatimago.com>

Vladimir Zolotykh <······@eurocom.od.ua> writes:

> Pascal Bourguignon wrote:
>> For example:
>>      Mon 05 Dec 12:25:35 2005 EET
>>      DOW NN MMM HH MM SS YYYY ZZZ
>> Why would you need to parse this?
> Because usually I don't rely on that the number of spaces is fixed or
> predetermined as well as then number of digits and ТZ might be not only EET.

So you see the importance of specifying your language by a more formal
mean than a few examples.  You can formalize it with production rules,
or for a regular language, you can directly formalize it with regular
expressions.  If that's the case, obviously it's better to just copy
the regular expression to your parser.

In conclusion: if your language is a regular language then use regular
expressions.  If your language is not, then don't.  (And if you've got
an efficiency bottleneck and your language is weaker than regular,
don't parse it at all!)

-- 
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
__Pascal Bourguignon__                     http://www.informatimago.com/

From: Vladimir Zolotykh
Subject: Re: REGEXPS
Date: Mon, 05 Dec 2005 11:16:30 +0000
Message-ID: <dn17ic$ksc$1@dcs.eurocom.od.ua>

Pascal Bourguignon wrote:

> Depends if your language is regular or not.
I believe you ment "programming language" and it is Common Lisp
> 


-- 
Vladimir Zolotykh

From: Edi Weitz
Subject: Re: REGEXPS
Date: Mon, 05 Dec 2005 11:37:20 +0000
Message-ID: <u8xuz4vwv.fsf@agharta.de>

On Mon, 05 Dec 2005 13:16:30 +0200, Vladimir Zolotykh <······@eurocom.od.ua> wrote:

>> Depends if your language is regular or not.
> I believe you ment "programming language" and it is Common Lisp

No, he meant the (formal) language you want to parse.  That's a term
from CS:

  <http://en.wikipedia.org/wiki/Formal_language>

Note that there are a lot of languages that aren't regular.  It's a
common mistake to try to parse one of these languages using regular
expressions just because they are available and seem easy to use.

Cheers,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: Pascal Bourguignon
Subject: Re: REGEXPS
Date: Mon, 05 Dec 2005 11:38:46 +0000
Message-ID: <871x0rdb95.fsf@thalassa.informatimago.com>

Vladimir Zolotykh <······@eurocom.od.ua> writes:

> Pascal Bourguignon wrote:
>
>> Depends if your language is regular or not.
> I believe you ment "programming language" and it is Common Lisp

No, I mean the language of your data.

To parse words like this:

abc
aaabccc
aaaaabc

you need to define their language; for example:

start -> as 'b' cs ;
as -> 'a' ;
as -> 'a' cs ;
cs -> 'c' ;
cs -> 'c' cs ;

which is a regular language since it can be defined with a regular
expression: aa*bcc*


To parse words like this:

x
()
(x)
(x x)
((x) () (x))

you need to define their language; for example:

start -> e ;
e -> s;
e -> '(' es ')' ;
es -> e ;
es -> e es ;
s -> 'x' ;

And for this language, you cannot find a regular expression: it's not
a regular language.



And the point of my other answer was that for languages such as:

abc
abe
dbe

start -> o 'b' t ;
o -> 'a' ;
o -> 'd' ;
t -> 'c' ;
t -> 'e' ;

While you can use regular expressions, it's overkill, you can directly
access the values without any parsing: o = (subseq word 0 1)
                                       t = (subseq word 2 1)


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
The rule for today:
Touch my tail, I shred your hand.
New rule tomorrow.

From: N. Raghavendra
Subject: Re: REGEXPS
Date: Mon, 12 Dec 2005 09:55:40 +0000
Message-ID: <86wtiahc6b.fsf@riemann.mri.ernet.in>

At 2005-12-05T12:56:12+02:00, Vladimir Zolotykh wrote:

> Suppose I have to parse some date format like
> 
>    Mon 05 Dec 12:25:35 2005 EET
> 
> I could use regexps or write a function PARSE-DATE using only CHAR or
> READ-CHAR functions.

On a related note, the `timezone.el' Emacs Lisp library does this
using regular expressions.

Raghavendra.

-- 
N. Raghavendra <·····@mri.ernet.in> | See message headers for contact
Harish-Chandra Research Institute   | and OpenPGP details.