implementation for Parsing Expression Grammar?

From: ······@gmail.com
Subject: implementation for Parsing Expression Grammar?
Date: Sat, 10 May 2008 05:52:30 +0000
Message-ID: <a450ef08-642f-484d-97bc-614121b502f8@c19g2000prf.googlegroups.com>

In the past weeks i've been thinking over the problem on the practical
problems of regex in its matching power. For example, often it can't
be used to match anything of nested nature, even the most simple
nesting. It can't be used to match any simple grammar expressed by
BNF. Some rather very regular and simple languages such as XML, or
even url, email address, are not specified as a regex. (there exist
regex that are pages long that tried to match email address though)

I wrote out a more elaborate account of my thoughts here:
 http://xahlee.org/cmaci/notation/pattern_matching_vs_pattern_spec.html

----------------

After days of researching this problem, looking into parsers and its
theories etc, today i found the answer!!

What i was looking for is called Parsing Expression Grammar (PEG).

See
http://en.wikipedia.org/wiki/Parsing_expression_grammar

It seems to me it's already in Perl6, and there's also a
implementation in Haskell. Is the perl6 PEG is in a usable state?

Thanks.

  Xah
  ···@xahlee.org
∑ http://xahlee.org/

☄

Re: implementation for Parsing Expression Grammar? Kay Schluehr
- Re: implementation for Parsing Expression Grammar? C S S
  - Re: implementation for Parsing Expression Grammar? D Herring
Re: implementation for Parsing Expression Grammar? =?UTF-8?B?TGFycyBSdW5lIE7DuHN0ZGFs?=
- Re: implementation for Parsing Expression Grammar? Barb Knox
Re: implementation for Parsing Expression Grammar? George Neuner
- Re: implementation for Parsing Expression Grammar? ······@gmail.com
  - Re: implementation for Parsing Expression Grammar? George Neuner
Re: implementation for Parsing Expression Grammar? Robert Maas, http://tinyurl.com/uh3t
- Re: implementation for Parsing Expression Grammar? George Neuner
Re: implementation for Parsing Expression Grammar? Stanisław Halik

From: Kay Schluehr
Subject: Re: implementation for Parsing Expression Grammar?
Date: Sat, 10 May 2008 13:41:36 +0000
Message-ID: <ef9aee02-661f-4556-b0b7-76434f57ca55@m36g2000hse.googlegroups.com>

On 10 Mai, 07:52, ·······@gmail.com" <······@gmail.com> wrote:
> In the past weeks i've been thinking over the problem on the practical
> problems of regex in its matching power. For example, often it can't
> be used to match anything of nested nature, even the most simple
> nesting. It can't be used to match any simple grammar expressed by
> BNF. Some rather very regular and simple languages such as XML, or
> even url, email address, are not specified as a regex.

Well formed XML cannot be fully specified within BNF as well because
it is context sensitive: in order to recognize a tag/endtag pair one
has to maintain a stack. That's not a big deal in practice if one
wants to write an XML parser but one can't use an arbitrary LL or LR
parser generator to produce a parse tree representing the XML.

From: C S S
Subject: Re: implementation for Parsing Expression Grammar?
Date: Sat, 10 May 2008 21:27:23 +0000
Message-ID: <g053vr$dre$1@news.lrz-muenchen.de>

Kay Schluehr wrote:
> On 10 Mai, 07:52, ·······@gmail.com" <······@gmail.com> wrote:
> 
>>In the past weeks i've been thinking over the problem on the practical
>>problems of regex in its matching power. For example, often it can't
>>be used to match anything of nested nature, even the most simple
>>nesting. It can't be used to match any simple grammar expressed by
>>BNF. Some rather very regular and simple languages such as XML, or
>>even url, email address, are not specified as a regex.
> 
> 
> Well formed XML cannot be fully specified within BNF as well because
> it is context sensitive: in order to recognize a tag/endtag pair one
> has to maintain a stack. That's not a big deal in practice if one
> wants to write an XML parser but one can't use an arbitrary LL or LR
> parser generator to produce a parse tree representing the XML.
> 
Sorry, but I think that you are not completely right. XML is - I think - 
at least a Deterministic Contest-free syntax 
(http://en.wikipedia.org/wiki/Deterministic_context-free_language) - at 
least if you take the tag-names and attribute-names as terminal-symbols, 
rather than the unicode-characters directly. It can be parsed using a 
stack - as you said - i.e. using a deterministic pushdown automaton. I 
think it is even LL(1) (http://en.wikipedia.org/wiki/LL%281%29).

If you take each character as a terminal-symbol, then I think you are 
right. Then it is not context-free. But this is only because you have 
named closing-tags, if you didnt have closing-tags named, then the set 
of well-formed xml-data would be context-free again. It would be some 
(ugly, complicated) kind of noting s-espressions.

From: D Herring
Subject: Re: implementation for Parsing Expression Grammar?
Date: Sat, 10 May 2008 23:58:03 +0000
Message-ID: <rJGdncLGdKt9q7vVnZ2dnUVZ_rDinZ2d@comcast.com>

C S S wrote:
> If you take each character as a terminal-symbol, then I think you are 
> right. Then it is not context-free. But this is only because you have 
> named closing-tags, if you didnt have closing-tags named, then the set 
> of well-formed xml-data would be context-free again. It would be some 
> (ugly, complicated) kind of noting s-espressions.

but requiring the tags to be listed as terminal symbols kinda kills 
the whole "extensible" thing...

From: =?UTF-8?B?TGFycyBSdW5lIE7DuHN0ZGFs?=
Subject: Re: implementation for Parsing Expression Grammar?
Date: Sat, 10 May 2008 10:36:38 +0000
Message-ID: <48257ab7$0$29576$c83e3ef6@nn1-read.tele2.net>

Hi,
Finite automata works for "nested things".

http://en.wikipedia.org/wiki/Automata_theory

-- 
Lars Rune Nøstdal
http://nostdal.org/

From: Barb Knox
Subject: Re: implementation for Parsing Expression Grammar?
Date: Sat, 10 May 2008 23:25:43 +0000
Message-ID: <see-68826E.11254311052008@lust.ihug.co.nz>

In article <·························@nn1-read.tele2.net>,
 Lars Rune Nøstdal <···········@gmail.com> wrote:

> Hi,
> Finite automata works for "nested things".

Only in the special case when the depth of nesting is bounded ahead of 
time.  If it's unbounded then there is an unbounded amount of "stack" 
information that the automaton needs to remember, therefore a finite 
automaton cannot do it.

<pedantry>
That should be "Finite automata WORK ...", since "automata" is plural.
</pedantry>

> http://en.wikipedia.org/wiki/Automata_theory

-- 
---------------------------
|  BBB                b    \     Barbara at LivingHistory stop co stop uk
|  B  B   aa     rrr  b     |
|  BBB   a  a   r     bbb   |    Quidquid latine dictum sit,
|  B  B  a  a   r     b  b  |    altum viditur.
|  BBB    aa a  r     bbb   |   
-----------------------------

From: George Neuner
Subject: Re: implementation for Parsing Expression Grammar?
Date: Sun, 11 May 2008 01:26:54 +0000
Message-ID: <vdhc245g2fh8que71re3beuok2i5e8snm9@4ax.com>

On Fri, 9 May 2008 22:52:30 -0700 (PDT), ·······@gmail.com"
<······@gmail.com> wrote:

>In the past weeks i've been thinking over the problem on the practical
>problems of regex in its matching power. For example, often it can't
>be used to match anything of nested nature, even the most simple
>nesting. It can't be used to match any simple grammar expressed by
>BNF. Some rather very regular and simple languages such as XML, or
>even url, email address, are not specified as a regex. (there exist
>regex that are pages long that tried to match email address though)

What's your point?  The limitations of regular expressions are well
known.

>After days of researching this problem, looking into parsers and its
>theories etc, today i found the answer!!
>
>What i was looking for is called Parsing Expression Grammar (PEG).

PEG has its own problems - it's very easy with PEG to create subtly
ambiguous grammars for which quite legal looking input is rejected.
And there are no good tools to analyze a PEG and warn you of subtle
problems.

Chris Clark (YACC++) has posted at length about the merits, problems
and limitations of various parse techniques - including PEG - in
comp.compilers.  Before you consider doing anything with PEG I suggest
you look up his posts and read the related threads.

>It seems to me it's already in Perl6, and there's also a
>implementation in Haskell. Is the perl6 PEG is in a usable state?
>
>Thanks.
>
>  Xah
>  ···@xahlee.org
>? http://xahlee.org/

George
--
for email reply remove "/" from address

From: ······@gmail.com
Subject: Re: implementation for Parsing Expression Grammar?
Date: Sun, 11 May 2008 16:36:08 +0000
Message-ID: <c4e468fa-9d97-423f-aebc-f143302cc0eb@z24g2000prf.googlegroups.com>

 George Neuner wrote:
«What's your point?  The limitations of regular expressions are well
known.»

Hi George,

Please see:
 http://xahlee.org/cmaci/notation/pattern_matching_vs_pattern_spec.html

At the bottom, i gave a account of the origin of my realization.
You'll see how it ties to formal systems (e.g. automatic theorem
proving) as well as a lisp source code reformatter.

  Xah
  ···@xahlee.org
∑ http://xahlee.org/

☄

From: George Neuner
Subject: Re: implementation for Parsing Expression Grammar?
Date: Mon, 12 May 2008 05:59:19 +0000
Message-ID: <thhf24dpqnlhefmhtgam7719rjtps0va24@4ax.com>

On Sun, 11 May 2008 09:36:08 -0700 (PDT), ·······@gmail.com"
<······@gmail.com> wrote:

> George Neuner wrote:
>�What's your point?  The limitations of regular expressions are well
>known.�
>
>Hi George,
>
>Please see:
> http://xahlee.org/cmaci/notation/pattern_matching_vs_pattern_spec.html
>
>At the bottom, i gave a account of the origin of my realization.
>You'll see how it ties to formal systems (e.g. automatic theorem
>proving) as well as a lisp source code reformatter.
>
>  Xah
>  ···@xahlee.org
>? http://xahlee.org/

Ok.  But I think PEG is overkill.

PEG is predicated LL(k) where k is potentially infinite and where
backtracking occurs on failure.  In power it is roughly equivalent to
GLR in that it handles unambiguous grammars that have long ambiguous
prefixes.  But it is not well understood and there are very few PEG
tools.  In particular existing tools can't always tell you whether
your grammar really is ambiguous rather than just containing ambiguous
prefixes.  PEG grammars have to be unambiguous in totality but any
particular substring can approach human language complexity.

Neither LL(k) for fixed k or LR has that problem.  LR(1) is sufficient
to describe almost all existing programming languages and most
existing data description languages are LL.  Any of the LR or LL
parser tools are likely all you'll need to design a useful language.

If you really want to play with PEG-like grammars, ANTLR v3 offers
LL(*) which is similar.  I suggest you read Terence Parr's thesis
which explains his approach to predicated LL(k) and his Powerpoint
slides on the LL(*) method:

http://www.antlr.org/papers/parr.phd.thesis.pdf
http://www.antlr.org/papers/Coverity-LL-star.ppt

Parr claims LL(*) is power equivalent to PEG and GLR, but more
practical to use.  I haven't seen independent review of the method and
I haven't tried it (I've designed several programming and data
languages but never needed anything approaching PEG complexity).

George
--
for email reply remove "/" from address

From: Robert Maas, http://tinyurl.com/uh3t
Subject: Re: implementation for Parsing Expression Grammar?
Date: Sun, 11 May 2008 23:00:39 +0000
Message-ID: <rem-2008may11-004@yahoo.com>

(You posted to too many newsgroups, trimmed.)

18TH MODEM DROPPAGE TODAY AT 15:15, SINCE FIRST LOGIN TODAY AT 10:49.
(During one of my earlier messages I lost count by one, sorry.)
MEAN TIME BETWEEN MODEM FAILURES <15 MINUTES.

> From: ·······@gmail.com" <······@gmail.com>
> In the past weeks i've been thinking over the problem on the
> practical problems of regex in its matching power. For example,
> often it can't be used to match anything of nested nature, even the
> most simple nesting. It can't be used to match any simple grammar
> expressed by BNF. Some rather very regular and simple languages
> such as XML, or even url, email address, are not specified as a
> regex. (there exist regex that are pages long that tried to match
> email address though)

A couple years ago I wrote a table-driven parser in Lisp that could
deal with nested stuff to some degree. The result of the parse was
a structure that matched the parse tree, so that further processing
could be done on that result. My test case was Received lines in spam.
The syntax was essentially the properly-structured version of
something like a regex crossed with BNF.

> I wrote out a more elaborate account of my thoughts here:
> http://xahlee.org/cmaci/notation/pattern_matching_vs_pattern_spec.html

Looking there now ...

   For example, in computing, one would like to say that email address has
   the form xyz, where xyz is a perl regex. (e.g. "[A-z0-9]·@[A-z]+\.com")

Per my system, that would be something more like
  (:ANYNUMBEROF1 (:CHOICE (:RANGE A Z) (:RANGE a z) (:RANGE 0 9)))
  (:EXACTLY ·@")
  (:ANYNUMBEROF1 (:CHOICE (:RANGE A Z) (:RANGE a z)))
  (:EXACTLY ".com")
I forget how I specified characters, such as in ranges, so I left
that unspecified there. But I'm pretty sure I used strings when
specifying exact text, so I put quote marks in there. No, I don't
feel like looking up the details right now. My general idea above
should be enough for your immediate interest.

19TH MODEM DROPPAGE TODAY AT 15:34, SINCE FIRST LOGIN TODAY AT 10:49.
(During one of my earlier messages I lost count by one, sorry.)
MEAN TIME BETWEEN MODEM FAILURES <15 MINUTES.

   It is quite desirable to have a general grammar language, designed in
   a human-readible way, and concise. With such a language, we could use
   it to verify if a text is a valid form. We could use it for human
   communication.

I think my nested notation is sufficiently clear. Also, typically I
break my pattern into multiple production rules, with a meaningful
name for each. Thus the above might be instead:

  Address = (User Atsign Host)
  User = (:ANYNUMBEROF1 (:CHOICE (:RANGE A Z) (:RANGE a z) (:RANGE 0 9)))
  Atsign = (:EXACTLY ·@")
  Host = (Hostname Dotcom)
  Hostname = (:ANYNUMBEROF1 (:CHOICE (:RANGE A Z) (:RANGE a z)))
  Dotcom = (:EXACTLY ".com")

20TH MODEM DROPPAGE TODAY AT 15:39, SINCE FIRST LOGIN TODAY AT 10:49.

> After days of researching this problem, looking into parsers and its
> theories etc, today i found the answer!!
> What i was looking for is called Parsing Expression Grammar (PEG).
> See
> http://en.wikipedia.org/wiki/Parsing_expression_grammar

Looking at that now ...

   Unlike CFGs, PEGs are not ambiguous; if a string parses, it has
   exactly one valid parse tree.

When using BNF or other formal specification of a grammar (syntax),
you have to be careful not to present overlapping rules, or you
need a precedence that one rule applies before another if both
match. Is the claim that with PEGs there is no such problem in the
first place??

          + Ordered choice: e[1] / e[2]

What if both e[1] and e[2] match exactly the same sub-string?
That would seem to me to create an ambiguity as to how to parse
that sub-string. How does a PEG resolve this ambiguity??

   The choice operator e[1] / e[2] first invokes e[1], and if e[1]
   succeeds, returns its result immediately. Otherwise, if e[1] fails,
   then the choice operator backtracks to the original input position at
   which it invoked e[1], but then calls e[2] instead, returning e[2]'s
   result.

OK, there's the answer, linear precedence. Basically they've
re-invented BNF with the caveat that the production rules are in
linear sequence and earlier rules override later rules, using the
choice operator to syntactically condense multiple sequential
production rules into a single mega-rule with multiple sequential
clauses. (Actually BNF already had that condensation IIRC, it just
didn't have the linear precedence on clauses.)

I'd need to look up whether my own structured parse-grammar used
precedence too. Maybe some other day if anybody asks.

21ST MODEM DROPPAGE TODAY AT 16:01, SINCE FIRST LOGIN TODAY AT 10:49.

22ND MODEM DROPPAGE TODAY AT 16:03.

From: George Neuner
Subject: Re: implementation for Parsing Expression Grammar?
Date: Mon, 12 May 2008 07:00:48 +0000
Message-ID: <neqf24tgeikjpb7uppuofq7snbg0m6i146@4ax.com>

On Sun, 11 May 2008 16:00:39 -0700,
···················@SpamGourmet.Com (Robert Maas,
http://tinyurl.com/uh3t) wrote:

>(You posted to too many newsgroups, trimmed.)
>
>18TH MODEM DROPPAGE TODAY AT 15:15, SINCE FIRST LOGIN TODAY AT 10:49.
>(During one of my earlier messages I lost count by one, sorry.)
>MEAN TIME BETWEEN MODEM FAILURES <15 MINUTES.
>
>> From: ·······@gmail.com" <······@gmail.com>
>> In the past weeks i've been thinking over the problem on the
>> practical problems of regex in its matching power. For example,
>> often it can't be used to match anything of nested nature, even the
>> most simple nesting. It can't be used to match any simple grammar
>> expressed by BNF. Some rather very regular and simple languages
>> such as XML, or even url, email address, are not specified as a
>> regex. (there exist regex that are pages long that tried to match
>> email address though)
>
>A couple years ago I wrote a table-driven parser in Lisp that could
>deal with nested stuff to some degree. The result of the parse was
>a structure that matched the parse tree, so that further processing
>could be done on that result. My test case was Received lines in spam.
>The syntax was essentially the properly-structured version of
>something like a regex crossed with BNF.
>
>> I wrote out a more elaborate account of my thoughts here:
>> http://xahlee.org/cmaci/notation/pattern_matching_vs_pattern_spec.html
>
>Looking there now ...
>
>   For example, in computing, one would like to say that email address has
>   the form xyz, where xyz is a perl regex. (e.g. "[A-z0-9]·@[A-z]+\.com")
>
>Per my system, that would be something more like
>  (:ANYNUMBEROF1 (:CHOICE (:RANGE A Z) (:RANGE a z) (:RANGE 0 9)))
>  (:EXACTLY ·@")
>  (:ANYNUMBEROF1 (:CHOICE (:RANGE A Z) (:RANGE a z)))
>  (:EXACTLY ".com")
>I forget how I specified characters, such as in ranges, so I left
>that unspecified there. But I'm pretty sure I used strings when
>specifying exact text, so I put quote marks in there. No, I don't
>feel like looking up the details right now. My general idea above
>should be enough for your immediate interest.
>
>19TH MODEM DROPPAGE TODAY AT 15:34, SINCE FIRST LOGIN TODAY AT 10:49.
>(During one of my earlier messages I lost count by one, sorry.)
>MEAN TIME BETWEEN MODEM FAILURES <15 MINUTES.
>
>   It is quite desirable to have a general grammar language, designed in
>   a human-readible way, and concise. With such a language, we could use
>   it to verify if a text is a valid form. We could use it for human
>   communication.
>
>I think my nested notation is sufficiently clear. Also, typically I
>break my pattern into multiple production rules, with a meaningful
>name for each. Thus the above might be instead:
>
>  Address = (User Atsign Host)
>  User = (:ANYNUMBEROF1 (:CHOICE (:RANGE A Z) (:RANGE a z) (:RANGE 0 9)))
>  Atsign = (:EXACTLY ·@")
>  Host = (Hostname Dotcom)
>  Hostname = (:ANYNUMBEROF1 (:CHOICE (:RANGE A Z) (:RANGE a z)))
>  Dotcom = (:EXACTLY ".com")
>
>20TH MODEM DROPPAGE TODAY AT 15:39, SINCE FIRST LOGIN TODAY AT 10:49.
>
>> After days of researching this problem, looking into parsers and its
>> theories etc, today i found the answer!!
>> What i was looking for is called Parsing Expression Grammar (PEG).
>> See
>> http://en.wikipedia.org/wiki/Parsing_expression_grammar
>
>Looking at that now ...
>
>   Unlike CFGs, PEGs are not ambiguous; if a string parses, it has
>   exactly one valid parse tree.

PEGs ca have arbitrary length ambiguous prefixes so they can have
complexity approaching CFG.  The reason for staying within LR(1) (or
the overlapping LL(k)) whenever possible is to keep ambiguity to
reasonable levels - even people start to have problems with more
complex languages.


>When using BNF or other formal specification of a grammar (syntax),
>you have to be careful not to present overlapping rules, or you
>need a precedence that one rule applies before another if both
>match. Is the claim that with PEGs there is no such problem in the
>first place??
>
>          + Ordered choice: e[1] / e[2]
>
>What if both e[1] and e[2] match exactly the same sub-string?
>That would seem to me to create an ambiguity as to how to parse
>that sub-string. How does a PEG resolve this ambiguity??
>
>   The choice operator e[1] / e[2] first invokes e[1], and if e[1]
>   succeeds, returns its result immediately. Otherwise, if e[1] fails,
>   then the choice operator backtracks to the original input position at
>   which it invoked e[1], but then calls e[2] instead, returning e[2]'s
>   result.
>
>OK, there's the answer, linear precedence. Basically they've
>re-invented BNF with the caveat that the production rules are in
>linear sequence and earlier rules override later rules, using the
>choice operator to syntactically condense multiple sequential
>production rules into a single mega-rule with multiple sequential
>clauses. (Actually BNF already had that condensation IIRC, it just
>didn't have the linear precedence on clauses.)
>
>I'd need to look up whether my own structured parse-grammar used
>precedence too. Maybe some other day if anybody asks.
>
>21ST MODEM DROPPAGE TODAY AT 16:01, SINCE FIRST LOGIN TODAY AT 10:49.
>
>22ND MODEM DROPPAGE TODAY AT 16:03.

--
for email reply remove "/" from address

From: Stanisław Halik
Subject: Re: implementation for Parsing Expression Grammar?
Date: Mon, 12 May 2008 11:29:04 +0000
Message-ID: <g099lt$a17$1@news2.task.gda.pl>

In comp.lang.lisp ······@gmail.com <······@gmail.com> wrote:

> In the past weeks i've been thinking over the problem on the practical
> problems of regex in its matching power. Some rather very regular and
> simple languages such as XML, or even url, email address, are not
> specified as a regex. (there exist regex that are pages long that
> tried to match email address though)

Wouldn't be so sure about that. Been using the following Perl regexp to
parse FQDNs:

^(?i)((?=[^-])[a-z0-9-]*[a-z0-9]\.)+[a-z]{2,6}$

It ensures that there are empty labels in the host names as well as that
the hyphen is only used in between characters in a label.

The username part in an email is trivial:

^(?i)[a-z0-9_+'.-]+$

Probably missed some allowed characters, though.

-- 
The great peril of our existence lies in the fact that our diet consists
entirely of souls. -- Inuit saying