From: William Bland
Subject: OT:  Recursive descent parsers
Date: 
Message-ID: <pan.2004.06.22.03.50.45.696007@abstractnonsense.com>
Yet another rather off topic question, posted here because Lisp hackers
know everything ;-)

I missed out on a formal computing education, as I was busy doing pure
mathematics, so I never learned about parsers.  This has become important
recently because I need to write one at work (unfortunately in Java rather
than Lisp, where it'd be much easier). I already know how to write it, and
I *think* it's what you'd call a "recursive descent parser", but every
time I try to read an article on parsers I get bogged down in the BNF,
which I don't find easy to read at all.

Basically the structure I'm thinking of for my parser is a top-level
function that dispatches on the first character it reads, and calls a
bunch of helper functions, something like:

(defun read-stuff ()
  (let ((c (peek-char)))
    (cond ((char= c #\Space)
	   (read-char)
	   (read-stuff))
	  ((char= c #\()
	   (read-char)
	   (read-list))
	  ((digit-char-p c)
	   (read-number))
	  (t
	   "ERROR!"))))

(defun read-list ()
  (let ((c (peek-char)))
    (cond ((char= c #\))
	   (read-char)
	   nil)
	  (t
	   (cons (read-stuff) (read-list))))))

(defun read-number ()
  (read-number-helper 0))

(defun read-number-helper (acc)
  (let ((digit (- (char-code (read-char))
		  (char-code #\0))))
    (setf acc (+ (* 10 acc) digit))
    (if (digit-char-p (peek-char))
	(read-number-helper acc)
        acc)))

but, of course, I'll have to write it in Java (yuck!).  I write parsers
this way just because I can't imagine any other way to write them (well, I
suppose I can, but they wouldn't be pretty).  I'd be interested to know
though:

1) is this a "normal" way to write a parser?  I've seen similar things in
Lisp books, so I know I'm not the *only* person who writes them this way,
but do other people write them like this (e.g. when people write parsers
for languages like Java)?

2) is this what people mean when they say "recursive descent parser"?

Cheers,
	Bill.
-- 
Dr. William Bland.
It would not be too unfair to any language to refer to Java as a
stripped down Lisp or Smalltalk with a C syntax.   (Ken Anderson).

From: Paul
Subject: Re: OT:  Recursive descent parsers
Date: 
Message-ID: <40D7F0AC.1000006@uidaho.edu>
Look into the structure of lex and yacc(flex/bison).
The man pages are not too bad.

If you are familiar with automata theory and computability, a technical 
book on parsing should bring you up to speed easily.

There are three general questions:
-> lexical analysis (is this string in the set of permissable words, and 
is it [keyword|identifier|etc]?)

->syntactical anlysis (is this string of words in the grammar of this 
language or can it be transformed into the grammar of this language? 
(here is where bnf shows up)

->semantic analysis (assigning meaning to each construct in the grammar)
This is "annotated grammar"

Look into the compiler book by Aho, if you can find it.


William Bland wrote:
> Yet another rather off topic question, posted here because Lisp hackers
> know everything ;-)
> 
> I missed out on a formal computing education, as I was busy doing pure
> mathematics, so I never learned about parsers.  This has become important
> recently because I need to write one at work (unfortunately in Java rather
> than Lisp, where it'd be much easier). I already know how to write it, and
> I *think* it's what you'd call a "recursive descent parser", but every
> time I try to read an article on parsers I get bogged down in the BNF,
> which I don't find easy to read at all.
> 
> Basically the structure I'm thinking of for my parser is a top-level
> function that dispatches on the first character it reads, and calls a
> bunch of helper functions, something like:
> 
> (defun read-stuff ()
>   (let ((c (peek-char)))
>     (cond ((char= c #\Space)
> 	   (read-char)
> 	   (read-stuff))
> 	  ((char= c #\()
> 	   (read-char)
> 	   (read-list))
> 	  ((digit-char-p c)
> 	   (read-number))
> 	  (t
> 	   "ERROR!"))))
> 
> (defun read-list ()
>   (let ((c (peek-char)))
>     (cond ((char= c #\))
> 	   (read-char)
> 	   nil)
> 	  (t
> 	   (cons (read-stuff) (read-list))))))
> 
> (defun read-number ()
>   (read-number-helper 0))
> 
> (defun read-number-helper (acc)
>   (let ((digit (- (char-code (read-char))
> 		  (char-code #\0))))
>     (setf acc (+ (* 10 acc) digit))
>     (if (digit-char-p (peek-char))
> 	(read-number-helper acc)
>         acc)))
> 
> but, of course, I'll have to write it in Java (yuck!).  I write parsers
> this way just because I can't imagine any other way to write them (well, I
> suppose I can, but they wouldn't be pretty).  I'd be interested to know
> though:
> 
> 1) is this a "normal" way to write a parser?  I've seen similar things in
> Lisp books, so I know I'm not the *only* person who writes them this way,
> but do other people write them like this (e.g. when people write parsers
> for languages like Java)?
> 
> 2) is this what people mean when they say "recursive descent parser"?
> 
> Cheers,
> 	Bill.
From: William D Clinger
Subject: Re: OT:  Recursive descent parsers
Date: 
Message-ID: <fb74251e.0406221142.263bca44@posting.google.com>
William Bland asked:
> 2) is this what people mean when they say "recursive descent parser"?

A recursive descent parser is an LL(1) parser that's implemented
along the lines you gave.  A parser that's implemented along those
lines but isn't LL(1) is not a recursive descent parser.  It might
be recursive descent with backup, or even worse, but it isn't a
recursive descent parser.

This distinction is important because recursive descent is one of
the very best parsing algorithms, while recursive descent with backup
is one of the worst parsing algorithms.  People who don't understand
any parsing theory at all often get them confused, with potentially
dire consequences.

Consult any standard textbook on compiler construction.  Look in
the index for "recursive descent", "top-down", LL(1), "director set".

Will
From: Rob Warnock
Subject: Re: OT:  Recursive descent parsers
Date: 
Message-ID: <YP-dnfxTh7seVEXdRVn-jA@speakeasy.net>
William D Clinger <··········@verizon.net> wrote:
+---------------
| William Bland asked:
| > 2) is this what people mean when they say "recursive descent parser"?
| 
| A recursive descent parser is an LL(1) parser that's implemented
| along the lines you gave.  A parser that's implemented along those
| lines but isn't LL(1) is not a recursive descent parser.  It might
| be recursive descent with backup, or even worse, but it isn't a
| recursive descent parser.
+---------------

To expand a little (though rather informally and not very precisely)
on what Will said, an LL(1) grammar is one in which the next production
(non-terminal) to be reduced can be determined just from looking at the
next terminal symbol encountered in the input stream. This means that
it works very much better if you have a separate lexical analyzer in
front of it so that the LL(1) parser is looking at "tokens", not raw
characters.

That said, recursive descent parsers work *very* well on "keyword"-
style grammars, such as are used in classical Algol-like control
structures (IF/THEN/ELSE/FI, etc.), but not so well for things like
complex arithmetic expressions.

Fortunately, when building small ad-hoc parsers (such as the one you
[Bland] seem to be working on), there is a very useful hybrid parsing
technique which I first encountered in the BLISS-10 compiler [also
see Wulf, et al, "The Design of an Optimizing Compiler", on BLISS-11],
and have used with great success myself in several small projects:
Combine a recursive descent parser for control structures (and other
"large" features) with a simple-operator-precedence parser for arithmetic
expressions. The way you hook these together is to make the LL(1)
"keywords" be "operators" with a *very* high precedence (higher than
any normal arithmetic operator), so that when you encounter one of
them you always try to reduce it first. Since the recursive descent
routines call the simple-operator-precedence parser to parse any
subexpressions, the two styles nicely weave back & forth between
themselves.


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: William Bland
Subject: Re: OT:  Recursive descent parsers
Date: 
Message-ID: <pan.2004.06.23.04.18.30.640182@abstractnonsense.com>
On Tue, 22 Jun 2004 19:06:27 -0500, Rob Warnock wrote:
> That said, recursive descent parsers work *very* well on "keyword"-
> style grammars, such as are used in classical Algol-like control
> structures (IF/THEN/ELSE/FI, etc.), but not so well for things like
> complex arithmetic expressions.

Thanks for all the interesting replies.  Fortunately my parser is just for
reading configuration files, which will pretty much just have a
"name = value" format (where the value could be a string, a number or a
list, and lists are allowed to contain strings, numbers, and other
lists). I don't think I need anything more heavyweight than the parser I
have planned for this (although as several people have suggested, I will
separate out the lexer), but it's great to know of these techniques in
case I do ever need them.

Thanks again,
		Bill.
-- 
Dr. William Bland.
It would not be too unfair to any language to refer to Java as a
stripped down Lisp or Smalltalk with a C syntax.   (Ken Anderson).
From: Erann Gat
Subject: Re: OT:  Recursive descent parsers
Date: 
Message-ID: <gNOSPAMat-3C6401.15004623062004@nntp1.jpl.nasa.gov>
In article <······················@speakeasy.net>,
 ····@rpw3.org (Rob Warnock) wrote:

> William D Clinger <··········@verizon.net> wrote:
> +---------------
> | William Bland asked:
> | > 2) is this what people mean when they say "recursive descent parser"?
> | 
> | A recursive descent parser is an LL(1) parser that's implemented
> | along the lines you gave.  A parser that's implemented along those
> | lines but isn't LL(1) is not a recursive descent parser.  It might
> | be recursive descent with backup, or even worse, but it isn't a
> | recursive descent parser.
> +---------------
> 
> To expand a little (though rather informally and not very precisely)
> on what Will said, an LL(1) grammar is one in which the next production
> (non-terminal) to be reduced can be determined just from looking at the
> next terminal symbol encountered in the input stream. This means that
> it works very much better if you have a separate lexical analyzer in
> front of it so that the LL(1) parser is looking at "tokens", not raw
> characters.
> 
> That said, recursive descent parsers work *very* well on "keyword"-
> style grammars, such as are used in classical Algol-like control
> structures (IF/THEN/ELSE/FI, etc.), but not so well for things like
> complex arithmetic expressions.
> 
> Fortunately, when building small ad-hoc parsers (such as the one you
> [Bland] seem to be working on), there is a very useful hybrid parsing
> technique which I first encountered in the BLISS-10 compiler [also
> see Wulf, et al, "The Design of an Optimizing Compiler", on BLISS-11],
> and have used with great success myself in several small projects:
> Combine a recursive descent parser for control structures (and other
> "large" features) with a simple-operator-precedence parser for arithmetic
> expressions. The way you hook these together is to make the LL(1)
> "keywords" be "operators" with a *very* high precedence (higher than
> any normal arithmetic operator), so that when you encounter one of
> them you always try to reduce it first. Since the recursive descent
> routines call the simple-operator-precedence parser to parse any
> subexpressions, the two styles nicely weave back & forth between
> themselves.

Google on "parcil" to find a parser for C syntax written in Lisp in 
exactly this way.

E.
From: Thomas Schilling
Subject: Re: OT:  Recursive descent parsers
Date: 
Message-ID: <opr90okzy7trs3c0@news.CIS.DFN.DE>
William D Clinger wrote:

> This distinction is important because recursive descent is one of
> the very best parsing algorithms, while recursive descent with backup
> is one of the worst parsing algorithms.  People who don't understand
> any parsing theory at all often get them confused, with potentially
> dire consequences.

I heard of different opinions. While parsers for LL(1) are surely powerful 
for practical purposes it may not be that bad to have LL(2) or LL(3) 
parsers, since only the _worst case_ is bad while the average isn't that 
bad (with modern hardware).

The problem with LL(1) is that you often have to twist your grammar quite 
a lot (I'm not sure if it's even possible for every kind of grammar-- I 
guess not.) while LARL(1) is, in my experience (with bison), quite bad for 
error recovering.

-ts
-- 
      ,,
     \../   /  <<< The LISP Effect
    |_\\ _==__
__ | |bb|   | _________________________________________________
From: Joe Marshall
Subject: Re: OT:  Recursive descent parsers
Date: 
Message-ID: <brjbw6sn.fsf@comcast.net>
Thomas Schilling <······@yahoo.de> writes:

> William D Clinger wrote:
>
>> This distinction is important because recursive descent is one of
>> the very best parsing algorithms, while recursive descent with backup
>> is one of the worst parsing algorithms.  People who don't understand
>> any parsing theory at all often get them confused, with potentially
>> dire consequences.
>
> I heard of different opinions. While parsers for LL(1) are surely
> powerful for practical purposes it may not be that bad to have LL(2)
> or LL(3) parsers, since only the _worst case_ is bad while the average
> isn't that bad (with modern hardware).
>
> The problem with LL(1) is that you often have to twist your grammar
> quite a lot (I'm not sure if it's even possible for every kind of
> grammar-- I guess not.) while LARL(1) is, in my experience (with
> bison), quite bad for error recovering.

Of course you *could* just write things in Cambridge Polish notation
and finesse the entire issue.

-- 
~jrm
From: Thomas Schilling
Subject: Re: OT:  Recursive descent parsers
Date: 
Message-ID: <opr90qwjg1trs3c0@news.CIS.DFN.DE>
Joe Marshall wrote:

>> The problem with LL(1) is that you often have to twist your grammar
>> quite a lot (I'm not sure if it's even possible for every kind of
>> grammar-- I guess not.) while LARL(1) is, in my experience (with
>> bison), quite bad for error recovering.
>
> Of course you *could* just write things in Cambridge Polish notation
> and finesse the entire issue.

Unfortunately the lisp reader is quite complex too cause the rules for 
distinction of symbols and non-symbols (i.e. numbers) aren't LL(1) too I 
think. OTOH that's the lexer-part, though.

-ts
-- 
      ,,
     \../   /  <<< The LISP Effect
    |_\\ _==__
__ | |bb|   | _________________________________________________
From: William D Clinger
Subject: Re: OT:  Recursive descent parsers
Date: 
Message-ID: <fb74251e.0406241323.7d7af77c@posting.google.com>
Thomas Schilling <······@yahoo.de> wrote:
> I heard of different opinions. While parsers for LL(1) are surely powerful 
> for practical purposes it may not be that bad to have LL(2) or LL(3) 
> parsers, since only the _worst case_ is bad while the average isn't that 
> bad (with modern hardware).
> 
> The problem with LL(1) is that you often have to twist your grammar quite 
> a lot (I'm not sure if it's even possible for every kind of grammar-- I 
> guess not.) while LARL(1) is, in my experience (with bison), quite bad for 
> error recovering.

You're absolutely right.  I should have said LL(k), not LL(1).
Sorry about that.

Will
From: Antony Blakey
Subject: Re: OT:  Recursive descent parsers
Date: 
Message-ID: <cb8hf8$924$1@lust.ihug.co.nz>
William Bland wrote:
> Yet another rather off topic question, posted here because Lisp hackers
> know everything ;-)
> 
> I missed out on a formal computing education, as I was busy doing pure
> mathematics, so I never learned about parsers.  This has become important
> recently because I need to write one at work (unfortunately in Java rather
> than Lisp, where it'd be much easier). I already know how to write it, and
> I *think* it's what you'd call a "recursive descent parser", but every
> time I try to read an article on parsers I get bogged down in the BNF,
> which I don't find easy to read at all.
> 
> Basically the structure I'm thinking of for my parser is a top-level
> function that dispatches on the first character it reads, and calls a
> bunch of helper functions, something like:
> 
> (defun read-stuff ()
>   (let ((c (peek-char)))
>     (cond ((char= c #\Space)
> 	   (read-char)
> 	   (read-stuff))
> 	  ((char= c #\()
> 	   (read-char)
> 	   (read-list))
> 	  ((digit-char-p c)
> 	   (read-number))
> 	  (t
> 	   "ERROR!"))))
> 
> (defun read-list ()
>   (let ((c (peek-char)))
>     (cond ((char= c #\))
> 	   (read-char)
> 	   nil)
> 	  (t
> 	   (cons (read-stuff) (read-list))))))
> 
> (defun read-number ()
>   (read-number-helper 0))
> 
> (defun read-number-helper (acc)
>   (let ((digit (- (char-code (read-char))
> 		  (char-code #\0))))
>     (setf acc (+ (* 10 acc) digit))
>     (if (digit-char-p (peek-char))
> 	(read-number-helper acc)
>         acc)))
> 
> but, of course, I'll have to write it in Java (yuck!).  I write parsers
> this way just because I can't imagine any other way to write them (well, I
> suppose I can, but they wouldn't be pretty).  I'd be interested to know
> though:
> 
> 1) is this a "normal" way to write a parser?  I've seen similar things in
> Lisp books, so I know I'm not the *only* person who writes them this way,
> but do other people write them like this (e.g. when people write parsers
> for languages like Java)?

I usually use a parser generator - in java that would be JavaCC or 
Antlr, both of which are free. It is vastly easier to master BNF and use 
a tool than do the whole thing yourself unless the language you are 
trying to parse is tiny. Just handling parse errors in a meaningfull way 
can be a pain in the arse when doing it manually.

Both JavaCC and Antlr come with examples and IMO are very easy to use.

> 2) is this what people mean when they say "recursive descent parser"?

Yes, although usually the process is split into two components - a lexer 
and a parser. The lexer processes the characters, producing tokens, and 
the parser process tokens. There is no technical reason that you can't 
parse directly from the character stream, but in my experience it's 
easier not too if you do it by hand. Which I no longer do.

Recursive descent is only useful for a certain class of grammars. More 
powerful techniques exist, but recursive descent is the only method that 
you can write by hand, and it's far easier to debug because both the 
static and dynamic structure of the code maps onto the structure of the 
grammar in an intuitive way.

BTW there are quite a few good books on this topic - for your purposes 
maybe 'Building Parsers With Java - Steven Metsker', or for a more 
technical presentation (of more than parsing) 'Modern Compiler 
Implementation In Java - Andrew Appel'. The classic book is the Dragon 
Book - "Compilers:  Principles, Techniques and Tools", by Alfred V. Aho, 
Ravi Sethi,  and Jeffrey D. Ullman. There are many others but unless 
you're really interested in compiler construction they aren't relevant.