From: Andy Chambers
Subject: lexer for cl-yacc
Date: 
Message-ID: <df196b8c-0426-4df5-9e77-d5b986fe7368@s12g2000prg.googlegroups.com>
Hi,

I'm playing around with cl-yacc to see if I can parse some xpath
expressions.  How does one decide where the job of the lexer ends and
that of the parser begins.  Initially, I made my lexer recognize
words, literals, symbols and some keywords using the dso-lex package
but I'm starting to think I should leave most of that to the parser
itself and just use the lexer to highlight special symbols like
"[ ] /" et al.

Cheers,
Andy

From: Pascal Bourguignon
Subject: Re: lexer for cl-yacc
Date: 
Message-ID: <87ejc4tws8.fsf@thalassa.informatimago.com>
Andy Chambers <··············@googlemail.com> writes:
> I'm playing around with cl-yacc to see if I can parse some xpath
> expressions.  How does one decide where the job of the lexer ends and
> that of the parser begins.

Art.


> Initially, I made my lexer recognize
> words, literals, symbols and some keywords using the dso-lex package
> but I'm starting to think I should leave most of that to the parser
> itself and just use the lexer to highlight special symbols like
> "[ ] /" et al.

There is usually a difference of power in the language a lexer can
recognize, from that of a parser.  It would be logical to use the
lexer to scan all that can be scanned with a DFA = regular expression,
and to use the parser only to parse what needs a full push-down
automata.

Your first impulse was right.

In usual context free grammar, there is still a lot of non-recusive
stuff, that could actually be scanned with a lexer, foremost in
non-sexp-based languages.  For example, in C, the parentheses after if
or while are meaningless, they could be scanned, definining the tokens
to be /if *(/ and /while *(/ instead of /if/, /while/ and /(/.  But
all right, we don't do that, because it's more meaningful for our
brains to parse them: if_stmt ::= /if/ /(/ expression /)/ stmt [
/else/ stmt ] .  In lisp we usually don't care for this cruft, and
just write: if_stmt ::= /if/ expression expression [ expression ] .

Another reason why you may want to scan separately /if/ from /(/ is
that it allows the language designer to more easily change the
syntax. If you want to add a /when/ or a /unless/ keyword, it's more
modular to do so if you don't have to consider all the sugarly
syntactic contexts where they may happen.


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

HEALTH WARNING: Care should be taken when lifting this product,
since its mass, and thus its weight, is dependent on its velocity
relative to the user.
From: Maciej Katafiasz
Subject: Re: lexer for cl-yacc
Date: 
Message-ID: <fng03r$9u2$1@news.net.uni-c.dk>
Den Sat, 26 Jan 2008 16:46:15 +0100 skrev Pascal Bourguignon:

>> Initially, I made my lexer recognize
>> words, literals, symbols and some keywords using the dso-lex package
>> but I'm starting to think I should leave most of that to the parser
>> itself and just use the lexer to highlight special symbols like "[ ] /"
>> et al.
> 
> There is usually a difference of power in the language a lexer can
> recognize, from that of a parser.  It would be logical to use the lexer
> to scan all that can be scanned with a DFA = regular expression, and to
> use the parser only to parse what needs a full push-down automata.
> 
> Your first impulse was right.

Not to mention that, within sensible boundaries, offloading work to the 
lexer is a win, since its rules are much easier and much more implicit, 
so doing X is much less work than it'd be to define it within the parser. 
For instance, if you made "while" parse into KEYWORD_WHILE, you can start 
your parser rules with KEYWORD_WHILE. Whereas if you parsed them as WORD, 
you'd have to check for WORD == "while" at each spot while can occur at. 
So unless you want something like mutable keywords, there's work to be 
avoided by parsing them early.

Cheers,
Maciej

PS. Historically, that's exactly how the two-stage parsing architecture 
came to be. People noticed that whilst context-free parsers can express 
everything finite automata can, being mathematically more powerful, 
there's a difference in the ease of doing so. So they split out the 
simpler lexer stage and used the simpler language for it. Oh, and the 
development and standarisation of concepts such as what a character is 
helped too, if you read old books on parsing, you'll find them concerned 
with topics such as transformation of "b^H_" -> "underlined b" (for old 
languages tended to distinguish keywords in such ways), or agreeing the 
representation of source text as input by punched cards with that input 
by tapes.