I am currently writing parsers in Common Lisp to parse a number of
Internet headers. I would like looked and tried out a number of
approaches but would like some opinions from others who may have more
experience on some of the pitfalls.
Firstly regular expression pattern matchers do not appear suitable in
general as some headers allow recursive phrases (e.g. rfc2822
comments) which they cannot match directly.
I had an attempt at using the META pattern matcher but found the
syntax rather awkward to use, and in any case it still doesn't solve,
by itself, the recursive problem.
I therefore end up having to write my own parsers. It seems to me I
can do this either by parsing directly from a stream or reading in a
string and then parsing along that.
I constructed a macro DO-PARSE-STREAM which enables me to, for
example, write a function like
(defun scan-comment(ip op)
"Scan a comment phrase until closing #\)
on ip stream collecting it on op stream"
(do-parse-stream(c ,ip)
(#\\ (write-char (read-char ,ip) ,op)) ; quoted pair
(#\( (write-char c op) ; enclosed comment
(scan-comment ip op)
(write-char #\) op))
(#\) (return)) ; terminator
(TEXT (write-char c op))) ; valid text type character
(t (signal 'parse-error)
(write-char c op))) ; any other character
This is similar I guess to META but I find it easier to understand.
Matches can be against characters, symbols representing types or a
form. The macro automatically loops across the stream and use CASE,
TYPECASE or COND depending on what clauses are given.
The stream approach does mean that when parsing from a string I have
to use WITH-INPUT-FROM-STRING. In the implementation I am using,
CMUCL, this appears to be efficient however is it reasonable to assume
that it will be on other implementations?
I constructed a similar macro DO-PARSE-STRING which allows multiple
character matches, however to use that in parser I have to keep
passing around the current position in the string and the end position
which complicates things. If I wrap this into a class I am almost
reinventing string-streams which seems pointless apart from enabling
multiple-character look-ahead.
I have been able to write a general parser for Email address headers
and URL's with this approach, however are there any serious downfalls
of the limitation of having only one character look-ahead (apart from
not being able to decide what you parsing until you have read and
stored some characters and come across a terminator character)?
Are there any other approaches I should be looking at?
--
John A.R. Williams
··············@aston.ac.uk (Dr. John A.R. Williams) writes:
> I am currently writing parsers in Common Lisp to parse a number of
> Internet headers. I would like looked and tried out a number of
> approaches but would like some opinions from others who may have more
> experience on some of the pitfalls.
...
> I had an attempt at using the META pattern matcher but found the
> syntax rather awkward to use, and in any case it still doesn't solve,
> by itself, the recursive problem.
...
> Are there any other approaches I should be looking at?
Here's a posting about a META-based scheme (minus the reader-macro
syntax) that I wrote some time ago. I haven't looked at it in some
time and I've learned a lot since I wrote it so I may be appalled when
I look at it more closely. But some code with some example grammars
are linked to from the post.
<http://groups.google.com/groups?selm=m38yvm8bvm.fsf%40localhost.localdomain>
I'm hoping to polish this up for use in my book, so it may get better
in the future.
-Peter
--
Peter Seibel ·····@javamonkey.com
Lisp is the red pill. -- John Fraser, comp.lang.lisp
"Dr. John A.R. Williams" <··············@aston.ac.uk> wrote in message
···················@heisenberg.aston.ac.uk...
> This is similar I guess to META but I find it easier to understand.
> Matches can be against characters, symbols representing types or a
> form. The macro automatically loops across the stream and use CASE,
> TYPECASE or COND depending on what clauses are given.
Whatever floats your boat. If the META tool saves you time, then use it. If
it doesn't, don't worry about it.
> The stream approach does mean that when parsing from a string I have
> to use WITH-INPUT-FROM-STRING. In the implementation I am using,
> CMUCL, this appears to be efficient however is it reasonable to assume
> that it will be on other implementations?
I wouldn't worry about it either way. The STREAM abstraction is more general
and lets you use pretty much anything as a source. I wouldn't worry about
the performance until you need to parse your 10,000 email archive and
discover that the STREAM is slowing you down vs something else (like basic
disk I/O).
> I constructed a similar macro DO-PARSE-STRING which allows multiple
> character matches, however to use that in parser I have to keep
> passing around the current position in the string and the end position
> which complicates things. If I wrap this into a class I am almost
> reinventing string-streams which seems pointless apart from enabling
> multiple-character look-ahead.
>
> I have been able to write a general parser for Email address headers
> and URL's with this approach, however are there any serious downfalls
> of the limitation of having only one character look-ahead (apart from
> not being able to decide what you parsing until you have read and
> stored some characters and come across a terminator character)?
A heck of a lot of things can be handled with a single character of
lookahead. Those that can't may simply be specific exceptions to a more
general grammar that can be handled on a case by case basis (such as storing
extra state as you move forward through an "ambiguous" area). If 95% of your
parser works with a one char lookahead, no reason to redo the entire thing
for some select pieces.
Basically, rather than trying to write something generic to fits the unknown
needs of the world, write something that fits YOUR needs and advances your
project forward. There's also no reason why you can't have wrappers around a
simple stream that handles all of the lookahead you want, if it comes to
that, and you can do that at the last minute.
Regards,
Will Hartung
(·····@msoft.com)
··············@aston.ac.uk (Dr. John A.R. Williams) writes:
> I had an attempt at using the META pattern matcher but found the
> syntax rather awkward to use, and in any case it still doesn't solve,
> by itself, the recursive problem.
I was very impressed by META. The code was so simple and so elegant.
However, the code for something as complex as a high level (infix)
programming language gets a little complex and difficult to maintain.
For this purpose I prefer the ZEBU parser. Have you looked at ZEBU?
However, it appears that you are having issues with the lexical
analyzer and not the parser itself?
Petter