From: Frank Buss
Subject: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <95r7mz11qtr4$.13peku2oi20ok$.dlg@40tude.net>
I'm trying to write a C parser with cl-yacc. This is the beginning:

http://www.frank-buss.de/tmp/parser.lisp.txt

For compiling, the ASDF packages lexer, regex and yacc are needed from the
lispbuilder project

http://sourceforge.net/project/showfiles.php?group_id=159740

But when compiling it, there are 96 Shift/Reduce and 68 Reduce/Reduce
conflicts. I've used the grammar from 
http://www.lysator.liu.se/c/ANSI-C-grammar-y.html

I've added some precedence rules, but still the same number of conflicts.

Parsing simple C files doesn't work, it says:

Error: Unexpected terminal YACC-EOF-SYMBOL (value NIL)
Expected one of: (IDENTIFIER \( * AUTO EXTERN TYPEDEF STATIC REGISTER ENUM
UNSIGNED DOUBLE LONG SHORT VOID CHAR INT FLOAT SIGNED STRUCT UNION CONST
VOLATILE)

I wonder if the conflicts are a problem of the grammar or a problem of
CL-YACC? How I can fix it?

-- 
Frank Buss, ··@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

From: Brian Downing
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <MyPLf.578764$084.76186@attbi_s22>
In article <································@40tude.net>,
Frank Buss  <··@frank-buss.de> wrote:
> http://www.frank-buss.de/tmp/parser.lisp.txt
> 
> But when compiling it, there are 96 Shift/Reduce and 68 Reduce/Reduce
> conflicts. I've used the grammar from 
> http://www.lysator.liu.se/c/ANSI-C-grammar-y.html
> 
> I've added some precedence rules, but still the same number of conflicts.
> 
> Parsing simple C files doesn't work, it says:
> 
> Error: Unexpected terminal YACC-EOF-SYMBOL (value NIL)
> Expected one of: (IDENTIFIER \( * AUTO EXTERN TYPEDEF STATIC REGISTER ENUM
> UNSIGNED DOUBLE LONG SHORT VOID CHAR INT FLOAT SIGNED STRUCT UNION CONST
> VOLATILE)
> 
> I wonder if the conflicts are a problem of the grammar or a problem of
> CL-YACC? How I can fix it?

I mechanically converted the same C grammar you started with (by writing
a YACC grammar for CL-YACC and parsing the C grammar), and it works
fine.  It's likely you've made a mistake somewhere.

(For values of "fine" that include "produces an almost useless parse
tree." I was very disappointed by how much work after parsing is
required to even begin to understand declaration syntax.)

-bcd
-- 
*** Brian Downing <bdowning at lavos dot net> 
From: Frank Buss
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <s2h6fsywjnl6$.g1yxtbcod9de.dlg@40tude.net>
Brian Downing wrote:

> I mechanically converted the same C grammar you started with (by writing
> a YACC grammar for CL-YACC and parsing the C grammar), and it works
> fine.  It's likely you've made a mistake somewhere.

You are right, my lexer didn't worked, now it works, I've updated 
http://www.frank-buss.de/tmp/parser.lisp.txt
There are still many conflicts, but with a "(:muffle-conflicts :some)" it
doesn't list every one when compiling.

Now I can parse this file:


void t(int x) {
	printf("number: %i", x);
}

int main(int argc, char** argv) {
	t(10);
}


to this internal representation:


((VOID ("t" \( (INT "x") \))
       ({ (("printf" \( ("\"number: %i\"" \, "x") \)) \;) }))
 (INT ("main" \( ((INT "argc") \, (CHAR ((* *) "argv"))) \))
      ({ ((("t" \( 10 \)) \;)
          (RETURN 1 \;)) })))

-- 
Frank Buss, ··@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
From: Brian Downing
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <n%0Mf.579738$084.333575@attbi_s22>
In article <······························@40tude.net>,
Frank Buss  <··@frank-buss.de> wrote:
> Brian Downing wrote:
> > I mechanically converted the same C grammar you started with (by writing
> > a YACC grammar for CL-YACC and parsing the C grammar), and it works
> > fine.  It's likely you've made a mistake somewhere.
> 
> You are right, my lexer didn't worked, now it works, I've updated 
> http://www.frank-buss.de/tmp/parser.lisp.txt
> There are still many conflicts, but with a "(:muffle-conflicts :some)" it
> doesn't list every one when compiling.

When your grammar is correct you should only have one shift/reduce
conflict (the classic if-else one).

> Now I can parse this file:
> 
> void t(int x) {
> 	printf("number: %i", x);
> }
> 
> int main(int argc, char** argv) {
> 	t(10);
> }
> 
> to this internal representation:
> 
> 
> ((VOID ("t" \( (INT "x") \))
>        ({ (("printf" \( ("\"number: %i\"" \, "x") \)) \;) }))
>  (INT ("main" \( ((INT "argc") \, (CHAR ((* *) "argv"))) \))
>       ({ ((("t" \( 10 \)) \;)
>           (RETURN 1 \;)) })))

I get:

((DEFUN ((DECLARATION-SPECIFIERS void)
         (DECLARATOR
          (t ((PARAMETER (DECLARATION-SPECIFIERS int) (DECLARATOR x))))))
        (PROGN (printf "number: %i" x)))
 (DEFUN ((DECLARATION-SPECIFIERS int)
         (DECLARATOR
          (main
           ((PARAMETER (DECLARATION-SPECIFIERS int) (DECLARATOR argc))
            (PARAMETER (DECLARATION-SPECIFIERS char) (DECLARATOR * * argv))))))
        (PROGN (t 10))))

...but I spent a while lispifying and cleaning up the output of the
grammar.  (The lower case "symbols" are tokens printed in brief form -
their unprintable representation is something like #<TOKEN IDENTIFIER
"argc" (4,13)>.)

-bcd
-- 
*** Brian Downing <bdowning at lavos dot net> 
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <87k6bjlh8j.fsf@qrnik.zagroda>
Frank Buss <··@frank-buss.de> writes:

> I'm trying to write a C parser with cl-yacc. This is the beginning:
>
> http://www.frank-buss.de/tmp/parser.lisp.txt

I was wondering how do you distinguish type names from other
identifiers (they need information from later stages of processing),
but it seems that you don't. Aren't you confusing type_name with
TYPE_NAME?

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: Frank Buss
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <2rca1924hjtp$.12ewajtq86c2c.dlg@40tude.net>
Marcin 'Qrczak' Kowalczyk wrote:

> I was wondering how do you distinguish type names from other
> identifiers (they need information from later stages of processing),
> but it seems that you don't.

Yes, I have to implement this step later. I'm planning to use a context
object while parsing, which has a hashmap with all types (and macros for
the preprocessor etc.).

> Aren't you confusing type_name with TYPE_NAME?

Thanks, this was a subtle bug, because the Lisp reader is not case
sensitive, like the C parser, now it has only 3 Shift/Reduce and 7
Reduce/Reduce conflicts, I've update the upload. Do you know if the C
grammar is conflict free at all, or is there another bug in my
implementation of the grammar?

-- 
Frank Buss, ··@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
From: Kaz Kylheku
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <1140889844.769683.100760@i39g2000cwa.googlegroups.com>
Frank Buss wrote:

> Thanks, this was a subtle bug, because the Lisp reader is not case
> sensitive, like the C parser, now it has only 3 Shift/Reduce and 7
> Reduce/Reduce conflicts, I've update the upload. Do you know if the C
> grammar is conflict free at all, or is there another bug in my
> implementation of the grammar?

The C grammar is not only not conflict free (in the context of building
a LALR(1) table-driven parser) but it's actually ambiguous, and as such
not LR(k) at all.

There are situations in the C grammar where the lexical category of
something must be resolved using semantic information. Did you handle
these your lexer?

What is this?

  {
    x(y); /* function call? */
  }

What if we add:

  typedef int x;
  {
    x(y); /* declare y to be of type x. */
  }

Now x is a declaration specifier list and (y) is a declarator. Yes,
parentheses are allowed around an entire declarator not just within it
like in (*f)(int arg) or (*ptrarray)[10].

(Of course, Common Lisp parsing is also driven by semantics. You don't
know what the syntax is unless you know the operator. But at least by
the time that question arises, you are working with nice trees).

Also, what are you doing for preprocessing?
From: Frank Buss
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <15pcrrlw3sdco$.16dgrdzto6cqh.dlg@40tude.net>
Kaz Kylheku wrote:

> Also, what are you doing for preprocessing?

I've tried to write a lexer and a parser for it, but it's getting
complicated. I think it is easier to do it in small steps:

1. trigraph substitution (can be disabled)
2. merge escaped newlines to single lines
3. substitute comments with one withspace
4. line-by-line reading and executing #directives and macro substituting
5. C parser

Step 4 could be a simple Lisp function, with some help of an expression
parser and the lispbuilder-regex package. The expression parser is needed
because the preprocessor can handle nearly all operations from C, e.g.

#define foo 2
#define x !(2-0xdl+2)?99:046+(1<<foo)
#if x==42
	printf("hallo");
#endif hello

compiles the printf line (at least with Visual C++). I've written an
expression parser and it works (see below, no conflicts :-) but I don't
like the global variable. Any idea, how I can pass a hashtable to the
parse-expression function, which I can access from the parser without a
global variable?

(defpackage #:preprocessor
  (:use :lispbuilder-regex :lispbuilder-lexer :lispbuilder-yacc :cl))

(in-package #:preprocessor)

;; a modified subset of http://www.lysator.liu.se/c/ANSI-C-grammar-l.html

(deflexer expression-lexer
  (">>" (return '|>>|))
  ("<<" (return '|<<|))
  ("&&" (return '\&\&))
  ("\\|\\|" (return '\|\|))
  ("<=" (return '|<=|))
  (">=" (return '|>=|))
  ("==" (return '|==|))
  ("!=" (return '|!=|))
  (":" (return '\:))
  ("\\(" (return '|(|))
  ("\\)" (return '|)|))
  ("&" (return '\&))
  ("!" (return '|!|))
  ("~" (return '|~|))
  ("-" (return '|-|))
  ("\\+" (return '|+|))
  ("\\*" (return '|*|))
  ("/" (return '|/|))
  ("%" (return '|%|))
  ("<" (return '|<|))
  (">" (return '|>|))
  ("\\^" (return '|^|))
  ("\\|" (return '\|))
  ("\\?" (return '|?|))
  ("\\(" (return '|(|))
  ("\\)" (return '|)|))

  ("([:alpha:]|_)([:alnum:]|_)*" (return (values 'IDENTIFIER %0)))

  ("0[xX]([a-fA-F0-9]+)(u|U|l|L)?"
   (return (values 'CONSTANT (parse-integer %1 :radix 16))))
  ("0([:digit:]+)(u|U|l|L)?"
   (return (values 'CONSTANT (parse-integer %1 :radix 8))))
  ("([:digit:]+)(u|U|l|L)?"
   (return (values 'CONSTANT (int %1))))
  ("([:alpha:]|_)?'(\\.|[^\\'])+'"
   (return (values 'CONSTANT (int %0))))

  ("([:digit:]+[Ee][+-]?[:digit:])+(f|F|l|L)?"
   (return (values 'CONSTANT (num %1))))
  ("([:digit:]*\\.[:digit:]+[Ee][+-]?[:digit:]+)(f|F|l|L)?"
   (return (values 'CONSTANT (num %1))))
  ("([:digit:]+\\.[:digit:]*[Ee][+-]?[:digit:]+)(f|F|l|L)?"
   (return (values 'CONSTANT (num %1))))

  ("[:space:]*"))

;; a modified subset of http://www.lysator.liu.se/c/ANSI-C-grammar-y.html

(defparameter *identifiers* (make-hash-table :test 'equal))

(define-parser expression-parser
  (:start-symbol expression)

  (:terminals
   (IDENTIFIER CONSTANT |>>| |<<| \&\& \|\| |<=| |>=| |==| |!=|
               \:  |(| |)| |[| |]| \& |!| |~| |-| |+| |*| |/| |%|
               |<| |>| |^| \| |?| |(| |)|))

  (expression
   logical-or-expression
   (logical-or-expression |?| expression \: expression
                          #'(lambda (a op1 b op2 c) (if a b c))))

  (logical-or-expression
   logical-and-expression
   (logical-or-expression \|\| logical-and-expression
                          #'(lambda (a op b) (logior a b))))

  (logical-and-expression
   inclusive-or-expression
   (logical-and-expression \&\& inclusive-or-expression
                           #'(lambda (a op b) (logand a b))))

  (inclusive-or-expression
   exclusive-or-expression
   (inclusive-or-expression \| exclusive-or-expression
                            #'(lambda (a op b) (bit-ior a b))))

  (exclusive-or-expression
   and-expression
   (exclusive-or-expression |^| and-expression
                            #'(lambda (a op b) (bit-xor a b))))

  (and-expression
   equality-expression
   (and-expression \& equality-expression
                   #'(lambda (a op b) (bit-and a b))))

  (equality-expression
   relational-expression
   (equality-expression |==| relational-expression
                        #'(lambda (a op b) (eql a b)))
   (equality-expression |!=| relational-expression
                        #'(lambda (a op b) (not (eql a b)))))

  (relational-expression
   shift-expression
   (relational-expression |<| shift-expression
                          #'(lambda (a op b) (< a b)))
   (relational-expression |>| shift-expression
                          #'(lambda (a op b) (> a b)))
   (relational-expression |<=| shift-expression
                          #'(lambda (a op b) (<= a b)))
   (relational-expression |>=| shift-expression
                          #'(lambda (a op b) (>= a b))))

  (shift-expression
   additive-expression
   (shift-expression |<<| additive-expression
                     #'(lambda (a op b) (ash a b)))
   (shift-expression |>>| additive-expression
                     #'(lambda (a op b) (ash a (- b)))))

  (additive-expression
   multiplicative-expression
   (additive-expression |+| multiplicative-expression
                        #'(lambda (a op b) (+ a b)))
   (additive-expression |-| multiplicative-expression
                        #'(lambda (a op b) (- a b))))

  (multiplicative-expression
   unary-expression
   (multiplicative-expression |*| unary-expression
                              #'(lambda (a op b) (* a b)))
   (multiplicative-expression |/| unary-expression
                              #'(lambda (a op b) (/ a b)))
   (multiplicative-expression |%| unary-expression)
   #'(lambda (a op b) (mod a b)))

  (unary-expression
   value
   (|+| unary-expression #'(lambda (op a) a))
   (|-| unary-expression #'(lambda (op a) (- a)))
   (|~| unary-expression #'(lambda (op a) (bit-not a)))
   (|!| unary-expression #'(lambda (op a) (not a))))

  (value
   (|(| expression |)| #'(lambda (op a op) a))
   CONSTANT
   (IDENTIFIER #'(lambda (name) (gethash name *identifiers*)))))

(defun parse-expression (string)
  (parse-with-lexer (expression-lexer string) expression-parser))

(defun lex-expression (string)
  (loop with lexer = (expression-lexer string) do
        (multiple-value-bind (token value) (funcall lexer)
          (if token
              (format t "t: ~a v: ~a~%" token value)
            (loop-finish)))))

(defparameter *test-expression* "!(2-0xdl+2)?99:046+(1<<foo)")

(defun test-parser ()
  (setf (gethash "foo" *identifiers*) 2)
  (parse-expression *test-expression*))

(defun test-lexer ()
  (lex-expression *test-expression*))

-- 
Frank Buss, ··@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
From: senator
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <1141084828.939179.28120@p10g2000cwp.googlegroups.com>
Hi Frank,

I am also using the lexer+regex+clawk(not used) packages along with
cl-yacc. I think there's an error in your c lexer, if you haven't
changed it in the meantime (I haven't kept up with the latest releases
of the (newly rejuvenated?) package). I am referring to:

(eval-when (:compile-toplevel :load-toplevel :execute)
  (defun double-values (symbol)
    `(return (values (quote ,symbol) (quote ,symbol))))

  (defun c-keyword (symbol)
    `(,(string-downcase (string symbol)) ,(double-values symbol))))

(deflexer c-lexer
  #.(c-keyword 'AUTO)
  #.(c-keyword 'BREAK)
  #.(c-keyword 'CASE)
  #.(c-keyword 'CHAR)
  #.(c-keyword 'CONST)
  #.(c-keyword 'CONTINUE)
  #.(c-keyword 'DEFAULT)
  ...... and so on.....
  "([:alpha:]|_)([:alnum:]|_)*" (return (values 'IDENTIFIER %0)))
  ...... and so on.....




This, if I'm not mistaken, get's converted to:
(deflexer c-lexer
  ("auto" (return (values 'AUTO 'AUTO)))
  ...... and so on.....
  "([:alpha:]|_)([:alnum:]|_)*" (return (values 'IDENTIFIER %0)))
  ...... and so on.....



I forsee problems with any identifier names that start with the same
prefices as the "c-keyword" tokens. For example,

int iffy_variable, caseys_strength, intelligence;

Gets lexed into:
int if fy_variable comma case ys_strength comma int elligence endstmt

The way this should be done, is the lexer should allow additional
token classifications, so you can also specify those
keywords. Alternatively, there could be some way to make the lexing
"non-greedy", but this might clash with the any greediness in the
regex specifying the tokens... I don't know.

My workaround for now involves:

  (deflexer asplexer
    ...... other stuff.....
    ("[a-zA-Z][a-zA-Z0-9_]*" (return
				(test-case %0 #'string-equal
				    ("if" (values 'if 'if))
				    ("case" (values 'case 'case))
				    ...... and so on.....
				    (otherwise (values 'ident %0)))))
     ...... and so on.....

That seems to work (this one above has been cleaned up a bit, but
hasn't been tested in the REPL, so might not run). There's my own
test-case that takes a test, so it can test for string-equals for
example.

Just for anyone interested, some of my thoughts on cl-yacc. I've been
thinking about what it takes to support EBNF. Not sure if this has
been supported, but I have a hack like this for * forms. Easily
extended to + or ? forms.

(defun star-expand (rule)
  (flet ((star-collect (x y)
	   (append x y)))
   (let ((rule* (gensym (strcat (string rule) "*"))))
     (values `(,rule*
	       ()
	       (,rule* ,rule)); ,#'star-collect
	     rule*))))

(again, undebugged, hasn't been used in a while, caveat emptor) a lot
of work needed here (for example those gensyms might make working
around inside the debugger harder), but even more work is needed
elsewhere first. More importantly, I think something that looks like
this would be better:

Instead of (from the examples page
http://www.pps.jussieu.fr/~jch/software/cl-yacc/cl-yacc.html)

(define-parser *expression-parser*
  (:start-symbol expression)
  (:terminals (int id + - * / |(| |)|))
  (:precedence ((:left * /) (:left + -)))

  (expression
   (expression + expression #'i2p)
   (expression - expression #'i2p)
   (expression * expression #'i2p)
   (expression / expression #'i2p)
   term)
   ....... etc .........

We can do
(define-parser *expression-parser*
  (:start-symbol expression)
  (:terminals (int id + - * / |(| |)|))
  (:precedence ((:left * /) (:left + -)))

  (expression
   ((expression + expression)(+ %0 %2))
   ((expression - expression)(- %0 %2))
   ((expression * expression)(* %0 %2))
   ((expression / expression)(/ %0 %2))
   ((term) (%0)))
   ....... etc .........

or who knows, maybe even something like
((modifiers name lparen paramslist rparen)
 ('FUNCTION %name ('MODIFIERS %modifiers) ('PARAMS %paramslist)))

Haven't really thought much about it, but there should probably be
alternative escaping mechanisms if we need to actually evaluate the
forms in place of the "list-specification" forms. The advantage of this
form is that it shows exactly how your data is formed into lists, and
is the most frequently used (for me anyway) pattern here. This is
especially convenient for list nesting, splicing, concatenation etc. I
might get around to this some day. For now, it is fermenting in my
thoughts. It will be interesting for me to check out that Antlr syntax
again, there might be more good stuff in there we can use...

Cheers, and good luck with the parsing.
From: Frank Buss
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <yqt3o5ej9i31.1vxz3fbfvescz.dlg@40tude.net>
senator wrote:

> This, if I'm not mistaken, get's converted to:
> (deflexer c-lexer
>   ("auto" (return (values 'AUTO 'AUTO)))
>   ...... and so on.....
>   "([:alpha:]|_)([:alnum:]|_)*" (return (values 'IDENTIFIER %0)))
>   ...... and so on.....
> 
> 
> 
> I forsee problems with any identifier names that start with the same
> prefices as the "c-keyword" tokens. For example,
> 
> int iffy_variable, caseys_strength, intelligence;
> 
> Gets lexed into:
> int if fy_variable comma case ys_strength comma int elligence endstmt

I think this is a bug of the lexer. The "identifier" pattern should be
greedy, so it should match all of "intelligence". This fails:

(deflexer c-lexer
  ("int" (return (values 'INT 'INT)))
  ("([:alpha:]|_)([:alnum:]|_)*" (return (values 'IDENTIFIER %0)))
  ("[:space:]"))

(defun test ()
  (assert
      (equal
       '((INT . INT)
         (IDENTIFIER . "intelligence"))
       (loop with lexer = (c-lexer "int intelligence ") 
             with tokens = '()
             finally (return (nreverse tokens)) do
             (multiple-value-bind (token value) (funcall lexer)
               (if token
                   (push (cons token value) tokens)
                 (loop-finish)))))))

-- 
Frank Buss, ··@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <87slq3izhl.fsf@qrnik.zagroda>
Frank Buss <··@frank-buss.de> writes:

> I think this is a bug of the lexer. The "identifier" pattern should
> be greedy, so it should match all of "intelligence".

It's generally better to recognize identifiers+keywords with a lexing
pattern, and then distinguish them by looking them up in a dictionary.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: Frank Buss
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <1fe1aer9yu12j.b5icx6fr0lfe$.dlg@40tude.net>
Marcin 'Qrczak' Kowalczyk wrote:

> It's generally better to recognize identifiers+keywords with a lexing
> pattern, and then distinguish them by looking them up in a dictionary.

Do you have a reason why I should look them up in a dictionary? I'll need
this for the user defined TYPE_NAME token, but I don't see a reason why I
should use this for the normal keywords, if the lexer behaves like flex and
is greedy for all patterns in parallel.

-- 
Frank Buss, ··@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <87r75nlnrb.fsf@qrnik.zagroda>
Frank Buss <··@frank-buss.de> writes:

>> It's generally better to recognize identifiers+keywords with a lexing
>> pattern, and then distinguish them by looking them up in a dictionary.
>
> Do you have a reason why I should look them up in a dictionary?

It makes lexer tables smaller. In a typical implementation every letter
in every keyword, except common prefixes, induces a 256-element table
(I have no idea whether this is true in the lexer generator you are
using though).

It doesn't rely on rules about overlapping patterns in the particular
lexer generator. With keywords matched separately they are overlapping,
because we definitely don't want to write a true regexp for identifiers
which excludes keywords.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: Frank Buss
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <uc5iuqrbhj6x$.voep3jrq3azs$.dlg@40tude.net>
Marcin 'Qrczak' Kowalczyk wrote:

> It doesn't rely on rules about overlapping patterns in the particular
> lexer generator. With keywords matched separately they are overlapping,
> because we definitely don't want to write a true regexp for identifiers
> which excludes keywords.

That's true, but for me it is easier to rely on the feature that
overlapping patterns are matched greedy, because otherwise I'll have more
problems. With the current lexer, numbers are not matched correctly. With
these patterns from my code:

  ("([:alpha:]|_)([:alnum:]|_)*" (return (values 'IDENTIFIER %0)))
  ("[:digit:]+(u|U|l|L)?" (return (values 'CONSTANT (int %0))))
  ("[:digit:]+[Ee][+-]?[:digit:]+(f|F|l|L)?" (return (values 'CONSTANT (num
%0))))

the number 1e20 is matched as "CONSTANT 1" and "IDENTIFIER e20". If I
understand the flex concept correctly, then the longest pattern is used for
every match and if there are 2 patterns with the same length, then the
first pattern is used. This makes things a lot easier.

I've written an eMail to the author of the lexer package, maybe he can fix
it.

-- 
Frank Buss, ··@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
From: senator
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <1141165336.585490.289610@v46g2000cwv.googlegroups.com>
> roblems. With the current lexer, numbers are not matched correctly. With
> these patterns from my code:
>
>   ("([:alpha:]|_)([:alnum:]|_)*" (return (values 'IDENTIFIER %0)))
>   ("[:digit:]+(u|U|l|L)?" (return (values 'CONSTANT (int %0))))
>   ("[:digit:]+[Ee][+-]?[:digit:]+(f|F|l|L)?" (return (values 'CONSTANT (num
> %0))))
>
> the number 1e20 is matched as "CONSTANT 1" and "IDENTIFIER e20". If I
> understand the flex concept correctly, then the longest pattern is used for
> every
>

The simplest thing to do might be to re-arrange the order you
define the regexps. I had the impression (can't remember if it was
from reading the source files) that the match is done in sequence,
from top down. (So what I mean is, define constant with e first, then
constant, then identifiers, and check the remaining regexps).

Have a nice one.
From: Michael Parker
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <280220062055329673%michaelparker@earthlink.net>
In article <························@v46g2000cwv.googlegroups.com>,
senator <···············@gmail.com> wrote:

> 
> The simplest thing to do might be to re-arrange the order you
> define the regexps. I had the impression (can't remember if it was
> from reading the source files) that the match is done in sequence,
> from top down. (So what I mean is, define constant with e first, then
> constant, then identifiers, and check the remaining regexps).

It's not really done in order, it just looks like it.  The compiler
generates tables to match rules in parallel, but unlike flex it doesn't
do *everything* with tables.
From: Michael Parker
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <280220062053332517%michaelparker@earthlink.net>
In article <·······························@40tude.net>, Frank Buss
<··@frank-buss.de> wrote:

> I've written an eMail to the author of the lexer package, maybe he can fix
> it.

I am the author of the lexer package.  The behavior you have seen is an
artifact of the way the cl-lex program is bolted on top of the regex
engine.  The lexer-generator basically concatenates the various
patterns with an '|', and puts in a hook at the end of each clause that
executes the code for the associated rule.  This short-circuits the
normal "find the longest match" logic.

The fix is to make the generated lexer record the length and registers,
then force a backtrack.  The final rule then checks the various matches
to find the first rule that matched the longest string, then and
advances the lexer and runs the code for the rule that "won".  This
approach is substantially slower (because of the additional
backtracking) but more flex-like.

I'll take a look at doing a version of deflexer that uses this
approach.  In the meantime, see if you can work around the issue by
rearranging your rules and using the keyword lookup table technique.
From: Michael Parker
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <280220062341100530%michaelparker@earthlink.net>
In article <································@earthlink.net>, Michael
Parker <·············@earthlink.net> wrote:

> In article <·······························@40tude.net>, Frank Buss
> <··@frank-buss.de> wrote:
> 
> > I've written an eMail to the author of the lexer package, maybe he can fix
> > it.
> 
> I am the author of the lexer package.  The behavior you have seen is an
> artifact of the way the cl-lex program is bolted on top of the regex
> engine.  The lexer-generator basically concatenates the various
> patterns with an '|', and puts in a hook at the end of each clause that
> executes the code for the associated rule.  This short-circuits the
> normal "find the longest match" logic.
> 
> The fix is to make the generated lexer record the length and registers,
> then force a backtrack.  The final rule then checks the various matches
> to find the first rule that matched the longest string, then and
> advances the lexer and runs the code for the rule that "won".  This
> approach is substantially slower (because of the additional
> backtracking) but more flex-like.
> 
> I'll take a look at doing a version of deflexer that uses this
> approach.  In the meantime, see if you can work around the issue by
> rearranging your rules and using the keyword lookup table technique.

There's another version up.

If the symbol :flex-compatible is the first thing that follows the name
e.g. (deflexer foo :flex-compatible ...) it generates a lexer that
forces greediness between rules.

The new style of lexer is slower than the old semi-greedy style, so if
you can do what you need by using symbol tables and reordering your
rules that's the best way to do it.
From: Frank Buss
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <1gryjpb2soe2t.1hlcc1cl9z7jx.dlg@40tude.net>
Michael Parker wrote:

> There's another version up.
> 
> If the symbol :flex-compatible is the first thing that follows the name
> e.g. (deflexer foo :flex-compatible ...) it generates a lexer that
> forces greediness between rules.

Thanks, it works great!

> The new style of lexer is slower than the old semi-greedy style, so if
> you can do what you need by using symbol tables and reordering your
> rules that's the best way to do it.

I'm not a compiler expert, but it should be possible to write it in
constant time, independently of the number of rules. My naive idea would be
something like this:

These rules:

("ab" (return 'ab))
("abc" (return 'abc))
("abx" (return 'abx))
(" ")

could be translated to this alist tree:

(setf tree '((#\a ((#\b (ab (#\c (abc))
                            (#\x (abx))))))
             (#\Space (:ignore))))

and a match function could match the greediest path:

(match "abc   ab abx ab" tree) -> (ABC AB ABX AB)

Below is an ugly hack of the match function. I think this can be extended
to * and other regex functions and it could be even faster with hashtables
or vectors instead of association lists.

(defun match (string rules)
  (when (< 0 (length string))
    (loop for i from 0 to (length string)
          with string-char
          with token = nil
          with tokens = '()
          with branch = rules
          finally (return (nreverse tokens)) do
          (setf string-char (if (= i (length string)) nil (elt string i)))
          (if (atom (first branch))
              (setf token (first branch)
                    branch (cdr branch))
            (setf token nil))
          (let ((rule (assoc string-char branch)))
            (if rule
                (setf branch (cadr rule))
              (progn
                (if token
                    (unless (eql token :ignore) (push token tokens))
                  (format t "no token recognized, position: ~a~%" i))
                (setf branch (cadr (assoc string-char rules)))))))))

-- 
Frank Buss, ··@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
From: Michael Parker
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <010320060717058766%michaelparker@earthlink.net>
In article <·······························@40tude.net>, Frank Buss
<··@frank-buss.de> wrote:

> Thanks, it works great!

Glad to hear it.

> 
> > The new style of lexer is slower than the old semi-greedy style, so if
> > you can do what you need by using symbol tables and reordering your
> > rules that's the best way to do it.
> 
> I'm not a compiler expert, but it should be possible to write it in
> constant time, independently of the number of rules. My naive idea would be
> something like this:


It is indeed possible (see chapter 2 of just about any compiler-design
book), and this is more-or-less what flex does.  This approach is
theoretically optimal, but in practice is much slower than you'd think. 
For a real grammar, the tables get so big that they have to be
compressed and then decompressed at run time.  Even if the lexer
forgoes that and uses uncompressed tables, the access patterns into the
tables don't match well with modern cpus, and what you're still left
with is essentially an interpreter, whereas the regex matcher generated
by deflexer is completely compiled machine code.  If you don't use
:flex-compatible, the backtracking tends to be very small for a large
class of languages and the resulting analyser is very quick indeed.

I appreciate your example matcher, and were I more interested in
turning deflexer into a world-class lexer generator that's likely the
approach I'd use, despite the drawbacks, simply for compatibility with
flex.  However, I'm not really in the lexer-generator business.  It was
just something that I needed and was easy to glom onto the side of my
regex engine, and if I worked around the gotchas turned out to be
faster than flex for what I needed.

I did the table-driven thing in my compiler design classes in college
and can't really gen up the interest to write a full-scale version for
deflexer, but if you really want one please pursue it.  If there's one
thing the the C community has shown the Lisp community it's that we
needs more lexer generators.
From: Frank Buss
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <1brqi8apluvm2.gopccsyom6jp.dlg@40tude.net>
Michael Parker wrote:

> I did the table-driven thing in my compiler design classes in college
> and can't really gen up the interest to write a full-scale version for
> deflexer, but if you really want one please pursue it.  If there's one
> thing the the C community has shown the Lisp community it's that we
> needs more lexer generators.

Currently I need it for the CFFI generation, only, which means that
performance doesn't matter. And the best optimization tip is not to
optimize; if it is getting too slow, I'll take a look in the dragon books
how to implement it. They are on my book shelf, just in case when I'll need
it, but I'm more an application hacker, not a compiler engineer :-)

-- 
Frank Buss, ··@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
From: Michael Parker
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <280220062058270185%michaelparker@earthlink.net>
In article <··············@qrnik.zagroda>, Marcin 'Qrczak' Kowalczyk
<······@knm.org.pl> wrote:

> It makes lexer tables smaller. In a typical implementation every letter
> in every keyword, except common prefixes, induces a 256-element table
> (I have no idea whether this is true in the lexer generator you are
> using though).

no, this lexer generator creates backtracking lexers.  The problem is
that once it finds a matching rule it immediately runs the code for the
rule rather than checking to see if there might be a better one.  There
is a fix for this, I just have to code it.
From: Michael Parker
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <280220062033390914%michaelparker@earthlink.net>
In article <·······························@40tude.net>, Frank Buss
<··@frank-buss.de> wrote:

> Marcin 'Qrczak' Kowalczyk wrote:
> 
> > It's generally better to recognize identifiers+keywords with a lexing
> > pattern, and then distinguish them by looking them up in a dictionary.
> 
> Do you have a reason why I should look them up in a dictionary? I'll need
> this for the user defined TYPE_NAME token, but I don't see a reason why I
> should use this for the normal keywords, if the lexer behaves like flex and
> is greedy for all patterns in parallel.

The lexer does not behave at all like flex.  They use fundamentally
different algorithms.

It acts as though it matches each pattern in series (the implementation
is partly parallel, partly serial).
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: shift/reduce conflicts with cl-yacc
Date: 
Message-ID: <877j7jh1gj.fsf@qrnik.zagroda>
Frank Buss <··@frank-buss.de> writes:

> which has a hashmap with all types (and macros for the preprocessor etc.).

If you are parsing the source before being processed by the
preprocessor, them tricky macros will fool the parser. It's like
reading Lisp code while ignoring reader macros.

Assuming that this worked in practice, this is probably acceptable if
the goal is to extract FFI bindings or generate documentation, because
the preprocessor might lose information useful for programmers, and
tricky macros are rare. Although perhaps in this case it would be
better to ignore function bodies in whole and only match brackets to
find the end, so the impact of tricky macros is lower, limited to
toplevel declarations. The approach of skipping preprocessing is not
sufficient for parsing arbitrary valid C code.

Another problem with parsing real-life C code or even C headers are
compiler-specific constructs.

This grammar looks like C89. There is C99, a newer ISO/ANSI C standard.

If you are parsing before preprocessing, there might be #ifdef'ed out
sections which can't be parsed, e.g. for C++.

> Do you know if the C grammar is conflict free at all, or is there
> another bug in my implementation of the grammar?

I think it depends on how it is expressed. I don't know whether
conflict-free presentations are the norm or the exception though.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/