From: Maverik
Subject: State variables in cl-lexer
Date: 
Message-ID: <1162659131.319413.202250@m73g2000cwd.googlegroups.com>
Hi!

I've formerly written this message to Michael Parker
(··········@hotmail.com), but his mailbox seems to be unavailable. So,
I publish the message here.

Michael Parker wrote on the site
(http://www.geocities.com/mparker762/clawk.html?200629#lexer):

> Currently, the LEX/FLEX/BISON feature of switching productions on and off using state variables is not supported, but it's a pretty simple feature to add. If you're using LEXER and discover you need this feature, let me know.

I've been trying to use the lexer in a project, where the parsing of
PL/SQL is necessary. In such application the lack of state variables is
very obvious. Two most important things is the lexical recognition of
strings and C-style block comments.

The classic sample of C-style block comment recognition is something
like

"/*"                           { set_start_condition(comment); }
<comment>[^*]+        { ... }
<comment>"*"+[^*/]+ { ... }
<comment>"*"+"/"     { set_start_condition(INITIAL); }

Here "..." is a processing of the content, usually skip, but something
valuable in my case. It is very important to use the state variable
here, because there is necessary to recognize the termination
bi-character token "*/" here. Without state variables I see only two
solutions: first one, to declare a word-by-word recognition of the
comment and process it on a higher level, and second, to try to
dispatch the same source between several lexers.

There is almost the same situation with the strings, where the
recognition of an escaped terminator (like "\\"" or "''") is necessary.

How much time is required to implement the state variables in CL-Lexer?
May someone point me where to make changes to obtain such
functionality? Or, maybe, there is a workaround or another simple
solution for such cases?

At last and by the way, is it possible (and how easy) to implement a
case-sensitivity mode in CL-Lexer? PL/SQL is a case-insensitive
language (as well as many others), and I resolve the problem using the
generation of special regexes for keywords ( e.g., I generate
"[Tt][Aa][Bb][Ll][Ee]" regex for the keyword "table"), but I feel it as
a workaround. I hope, the case sensitivity may allow to optimize in
some way the lexical analysis.

From: Michael Parker
Subject: Re: State variables in cl-lexer
Date: 
Message-ID: <041120062037352157%michaelparker@earthlink.net>
In article <························@m73g2000cwd.googlegroups.com>,
Maverik <··············@gmail.com> wrote:

> Currently, the LEX/FLEX/BISON feature of switching productions on 
> and off using state variables is not supported, but it's a pretty 
> simple feature to add. If you're using LEXER and discover you need 
> this feature, let me know.

I'm not absolutely certain what solution I was thinking of when I wrote 
that, but I suspect that I was thinking of partitioning the patterns 
into the various conditions, and switching between them in the lexer 
function.  You could also pull this off by inserting a function node 
into the parse tree in front of each token's expression that tests the 
condition variable and fails or succeeds depending on whether it 
matches the condition for that token.


> At last and by the way, is it possible (and how easy) to implement a
> case-sensitivity mode in CL-Lexer? PL/SQL is a case-insensitive
> language (as well as many others), and I resolve the problem using the
> generation of special regexes for keywords ( e.g., I generate
> "[Tt][Aa][Bb][Ll][Ee]" regex for the keyword "table"), but I feel it as
> a workaround. I hope, the case sensitivity may allow to optimize in
> some way the lexical analysis.

The usual way to do this in systems like lex and yacc is to keep a hash 
table of all the keywords and their token, and have a pattern for 
symbols in the lexical grammer that matches symbols of any case (i.e. 
something like [A-Za-z][A-Za-z0-9]*).  The action for this uppercases 
the string and looks it up in the hash table.  If it's found then it 
returns the token from the hash table, if it isn't found then it 
returns the symbol token, with the string as the value of the token.
From: Maverik
Subject: Re: State variables in cl-lexer
Date: 
Message-ID: <1162724483.124256.46200@m73g2000cwd.googlegroups.com>
On Nov 5, 4:14 am, Michael Parker <·············@earthlink.net> wrote:

> You could also pull this off by inserting a function node
> into the parse tree in front of each token's expression that tests the
> condition variable and fails or succeeds depending on whether it
> matches the condition for that token.

Thanks for the advice, but can you help me more precisely? Let's
consider the next example:

(lexer:deflexer partlexer
  ;; General lexical processing
  ("[A-Za-z0-9]+" (return (values :id lexer:%0)))
  ("\\*" (return (values :mul lexer:%0)))
  ("/" (return (values :div lexer:%0)))
  ("[:space:]+")

  ;; Block comment processing
  ("/\\*" (setq comment t))      ; Start comment processing
  ("[^\\*]*")                            ; Skip anything not asterisk
  ("[\\*]+[^/]")                        ; Skip asterisks without tail
slash
  ("\\*+/" (setq comment nil))  ; Finish comment processing
)

(defparameter *lex* (partlexer "a / b /* *** */ c * d"))
(loop for token = (multiple-value-list (funcall *lex*)) while (car
token) do (print token))

The expected output is:

(:ID "a") (:DIV "/") (:ID "b") (:ID "c") (:MUL "*") (:ID "d")

but the actual output is:

(:ID "d")

for the flex-compatible mode and

(:ID "a") (:DIV "/") (:ID "b") (:DIV "/") (:MUL "*") (:MUL "*") (:MUL
"*") (:MUL "*") (:MUL "*") (:DIV "/") (:ID "c") (:MUL "*") (:ID "d")

for the pure (non-flex-compatible) one.

Surely, I expected, that comment rules skip the comment, but in the
flex-compatible mode they consume everything until the last asterisk,
and in the pure mode they were overriden by the rules for arithmetic
operators. How to change the lexer definition to fix this problem?

Next, I do not understand how to introduce a state variable in the
lexer definition? In the example I'm trying to use the variable
"comment", which would turn on when the begin of the comment is
recognized, and turn off at the end of the comment. However, I do not
know how to use the value of this variable to block/unblock the rules
for the comment processing. At last, the system gives me the warning,
that the variable 'comment' is undefined, and I do not know, how to
introduce it locally, because the deflexer doesn't allow me to use the
LET form. Therefore I've defined the variable globally via DEFVAR, but
I do not feel this right.

BTW, I have a system error, when I try to compile a lexer definition
with flex-compatible mode in a fasl:

Execution of a form compiled with errors.
Form:
  (ALT
 (SEQ
  (REG 0 (+ (CHARCLASS
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz)))
  (HOOK #))
 (SEQ (REG 0 *) (HOOK #)) (SEQ (REG 0 /) (HOOK #))
 (SEQ (REG 0 (+ (SPECCLASS SPACE))) (HOOK #)) (SEQ (REG 0 (SEQ / *))
(HOOK #))
 (SEQ (REG 0 (* (NCHARCLASS *))) (HOOK #))
 (SEQ (REG 0 (SEQ (+ (CHARCLASS *)) (NCHARCLASS /))) (HOOK #))
 (SEQ (REG 0 (SEQ (+ *) /)) (HOOK #)) (SUCCEED T))
Compile-time error:
  Objects of type FUNCTION can't be dumped into fasl files.
   [Condition of type SB-INT:COMPILED-PROGRAM-ERROR]

The error doesn't appear, when I use the pure-mode definition. Is it
bug or feature? I use SBCL.

Thanks!