Tokenizing streams in Lisp. How to?

From: Cesar Crusius
Subject: Tokenizing streams in Lisp. How to?
Date: Sun, 06 May 2001 22:09:18 +0000
Message-ID: <9d4i2e$shi$1@nntp.Stanford.EDU>

Hi! Somewhat beginner in Lisp here, with the following problem: I want
to write a lisp program that would take an input file as the argument,
and tokenize it according to some rules. Is there a 'better' way to do
this, whatever that means? I ask this because there may be some
standard techniques that are particular to Lisp (I'm familiar with
writing parsers in C++), or some package, or whatever. The more
generic topic would be of course text file processing in Lisp. Any
pointers would help! (If that's relevant, I'm using CLISP.)

Thanks!

-- 
Cesar Augusto Rorato Crusius            o      _     _         _
Stanford University            __o     /\_   _ \\o  (_)\__/o  (_)
··············@stanford.edu  _`\<,    _>(_) (_)/<_    \_| \   _|/' \/
www.stanford.edu/~crusius   (_)/(_)  (_)        (_)   (_)    (_)'  _\o_

He who sacrifices functionality for ease of use
Loses both and deserves neither

Re: Tokenizing streams in Lisp. How to? Kent M Pitman
- Re: Tokenizing streams in Lisp. How to? Cesar Crusius
  - Re: Tokenizing streams in Lisp. How to? Kent M Pitman
    - Re: Tokenizing streams in Lisp. How to? Cesar Crusius
Re: Tokenizing streams in Lisp. How to? dr.mtarver
Re: Tokenizing streams in Lisp. How to? Steve Long

From: Kent M Pitman
Subject: Re: Tokenizing streams in Lisp. How to?
Date: Mon, 07 May 2001 02:48:58 +0000
Message-ID: <sfwg0ehg9gl.fsf@world.std.com>

·······@localhost.debian.org (Cesar Crusius) writes:

> Hi! Somewhat beginner in Lisp here, with the following problem: I want
> to write a lisp program that would take an input file as the argument,
> and tokenize it according to some rules. Is there a 'better' way to do
> this, whatever that means? I ask this because there may be some
> standard techniques that are particular to Lisp (I'm familiar with
> writing parsers in C++), or some package, or whatever. The more
> generic topic would be of course text file processing in Lisp. Any
> pointers would help! (If that's relevant, I'm using CLISP.)

Why don't you give an example of the kind of tokenization you want to do,
and a hint of the kind of processing you plan to do on the resulting tokens.

From: Cesar Crusius
Subject: Re: Tokenizing streams in Lisp. How to?
Date: Mon, 07 May 2001 03:30:48 +0000
Message-ID: <9d54t8$5pn$1@nntp.Stanford.EDU>

Kent M Pitman wrote:
> 
> Why don't you give an example of the kind of tokenization you want to do,
> and a hint of the kind of processing you plan to do on the resulting tokens.

If you know CWEB, I was going to see how hard would be to reimplement
it in Lisp, now with generic tokenizers and printers. If you don't,
here's the thing: I would like to parse source code (C, Lisp,
whatever) into tokens, and then to print the code nicely as TeX or
something like this. Each code token has a particular type (variable,
keyword, and so on), and is printed differently according to type.
There are many other details, like the source code is actually inside
some other code (CWEB code), but that should give an idea of the task.

(btw, I know of other literate programming systems, just don't like
the way they're written.)

-- 
Cesar Augusto Rorato Crusius            o      _     _         _
Stanford University            __o     /\_   _ \\o  (_)\__/o  (_)
··············@stanford.edu  _`\<,    _>(_) (_)/<_    \_| \   _|/' \/
www.stanford.edu/~crusius   (_)/(_)  (_)        (_)   (_)    (_)'  _\o_

He who sacrifices functionality for ease of use
Loses both and deserves neither

From: Kent M Pitman
Subject: Re: Tokenizing streams in Lisp. How to?
Date: Mon, 07 May 2001 09:42:40 +0000
Message-ID: <sfwelu1mr5b.fsf@world.std.com>

·······@localhost.debian.org (Cesar Crusius) writes:

> 
> Kent M Pitman wrote:
> > 
> > Why don't you give an example of the kind of tokenization you want
> > to do, and a hint of the kind of processing you plan to do on the
> > resulting tokens.
> 
> If you know CWEB,

Nope.
> 
> I was going to see how hard would be to
> reimplement it in Lisp, now with generic tokenizers and printers. If
> you don't, here's the thing: I would like to parse source code (C,
> Lisp, whatever) into tokens, and then to print the code nicely as
> TeX or something like this. Each code token has a particular type
> (variable, keyword, and so on), and is printed differently according
> to type.  There are many other details, like the source code is
> actually inside some other code (CWEB code), but that should give an
> idea of the task.
> 
> (btw, I know of other literate programming systems, just don't like
> the way they're written.)

Common Lisp doesn't come with any specific tokenization routines per se,
which is why I ask.  I think you may be able to find a regular expression
package in either the CMU AI archives or some other such repository; the 
place to start your search for such libraries is usually
 http://www.alu.org/

The reason I asked for an example is that for tokenization, some people mean
just simple parse-forward, break on delimiter, tokenizing, which isn't
that hard to write from scratch.

 (defvar *whitespace* '(#\Space #\Tab #\Return #\Linefeed))

 (defun whitespace? (x) (member x *whitespace*))

 (defvar *single-char-ops* '(#\+ #\- #\* #\/ #\( #\) #\. #\, #\=))

 (defun single-char-op? (x) (member x *single-char-ops*))

 (defun tokenize (text)
   (let ((chars '()) (result '()))
     (flet ((next-token () 
              (when chars 
                (push (coerce (nreverse chars) 'string) result)
                (setq chars '()))))
       (dotimes (i (length text))
         (let ((ch (char text i)))
           (cond ((whitespace? ch) 
                  (next-token))
                 ((single-char-op? ch) 
                  (next-token)
                  (push ch chars)
                  (next-token))
                 (t
                  (push ch chars)))))
       (next-token)
       (nreverse result))))

 (tokenize "foo+bar = baz")
 => ("foo" "+" "bar" "=" "baz")

This could actually be optimized for speed in various obscure ways,
but I tried to pick a simple way of presenting it so you could ge the
general sense of typical style.

Many Lispers would prefer symbols to strings (e.g., so they can do 
pointer comparisons) and so might modify the  tokenize routine's
next-token operator as in:

 (defun tokenize (text)
   (let ((chars '()) (result '()))
     (flet ((next-token () 
              (when chars 
                (push 
		  (LOOKUP-TOKEN
	              (coerce (nreverse chars) 'string))
		  result)
                (setq chars '()))))
       (dotimes (i (length text))
         (let ((ch (char text i)))
           (cond ((whitespace? ch) 
                  (next-token))
                 ((single-char-op? ch) 
                  (next-token)
                  (push ch chars)
                  (next-token))
                 (t
                  (push ch chars)))))
       (next-token)
       (nreverse result))))

 (defun lookup-token (string)
   (intern             ; get an interned symbol
     (string-upcase    ; fold case to uppercase
       string)))
   

 (tokenize "foo+BAR = Baz")
 => (FOO + BAR = BAZ)

You could, of course, implement the token lookup in other ways, too.
e.g.,

 (defvar *operator-table* (make-hash-table :test 'equal))
 (setf (gethash "+" *operator-table*) 'plus)
 (setf (gethash "*" *operator-table*) 'times)
 (setf (gethash "/" *operator-table*) 'divided-by)
 (setf (gethash "-" *operator-table*) 'minus)
 (setf (gethash "=" *operator-table*) 'equals)

 (defun lookup-token (string)
   (or (gethash string *operator-table*) ;case-sensitive
       (intern (string-upcase string)))) ;case-folding

 (tokenize "Foo+BAR = Baz")
 => (FOO PLUS BAR EQUALS BAZ)

If you're looking for something that will recognize more complex patterns
than just simple break characters and single character operations, though,
then you want to do some state management or regular expressions or
something else...

- - - - 

Tokenization from a stream is similar, but uses 
 (read-char stream) ;read next char or signal error if none
 (read-char stream nil nil) ;read next char or return NIL if none
 (peek-char nil stream) ;peek at next char, signal error if none
 (peek-char nil stream nil nil) ;peek at next char, return NIL if none

- - - - 

It's also not that hard to build a datastructure for parsing operators
forward.  Just make a tree that can track the state at any point. e.g.,

 state := ( default-token . transitionlist )
 transitionlist ::= (( char1 . state1 )
                     ( char2 . state2 )
                     ( char3 . state3 ) ...)

In this way you can start with a transition list (initial chars of operators)
and peek forward, a character at a time, to find the longest operator 
matching.

 (defparameter *token-tree*
   ;; normally you'd have a program assemble this rather than
   ;; specifying it as a big, ugly lieral.
   '((#\+ . ( + . ((#\+ . ( ++ )))))
     (#\- . ( - . ((#\- . ( -- . ((#\> . ( --> ))))))))
     (#\/ . ( / ))
     (#\* . ( * ))
     (#\( . ( *LPAREN* ))
     (#\) . ( *RPAREN* ))))

 (defun read-char-or-token (stream)
   (let* ((ch (read-char stream nil nil))
	  (entry (assoc ch *token-tree*)))
     (if (not entry)
         ch
       (read-token (cdr entry) stream))))

 (defun read-token (state stream)
   (let* ((ch (peek-char nil stream nil nil))
          (entry (assoc ch (cdr state))))
     (cond (entry 
            (read-char stream) ;commit the char we peeked at
            (read-token (cdr entry) stream))
           (t ;no new state, so use default from previous state
            (car state)))))

 (with-input-from-string (str "x-foo--")
   (loop for item = (read-char-or-token str)
         while item
         do (print item)))
 #\x
 -
 #\f
 #\o
 #\o
 --


Anyway, it's way past my bedtime and I'm sleepy.
All of the above code is tested only lightly while half-asleep.
Hope it helps, though.

From: Cesar Crusius
Subject: Re: Tokenizing streams in Lisp. How to?
Date: Tue, 08 May 2001 05:43:08 +0000
Message-ID: <9d811c$a0j$1@nntp.Stanford.EDU>

Gee. This has to be the most friendly newsgroup on the net (coming
from some nasty places...). Thanks for the answer, it was very useful.
I also got an email answer mentioning META, which was also quite nice.

Thanks!

In article <···············@world.std.com>, Kent M Pitman wrote:
> ·······@localhost.debian.org (Cesar Crusius) writes:
> 
> > 
> > Kent M Pitman wrote:
> > > 
> > > Why don't you give an example of the kind of tokenization you want
> > > to do, and a hint of the kind of processing you plan to do on the
> > > resulting tokens.
> > 
> > If you know CWEB,
> 
> Nope.
> > 
> > I was going to see how hard would be to
> > reimplement it in Lisp, now with generic tokenizers and printers. If
> > you don't, here's the thing: I would like to parse source code (C,
> > Lisp, whatever) into tokens, and then to print the code nicely as
> > TeX or something like this. Each code token has a particular type
> > (variable, keyword, and so on), and is printed differently according
> > to type.  There are many other details, like the source code is
> > actually inside some other code (CWEB code), but that should give an
> > idea of the task.
> > 
> > (btw, I know of other literate programming systems, just don't like
> > the way they're written.)
> 
> Common Lisp doesn't come with any specific tokenization routines per se,
> which is why I ask.  I think you may be able to find a regular expression
> package in either the CMU AI archives or some other such repository; the 
> place to start your search for such libraries is usually
>  http://www.alu.org/
> 
> The reason I asked for an example is that for tokenization, some people mean
> just simple parse-forward, break on delimiter, tokenizing, which isn't
> that hard to write from scratch.
> 
>  (defvar *whitespace* '(#\Space #\Tab #\Return #\Linefeed))
> 
>  (defun whitespace? (x) (member x *whitespace*))
> 
>  (defvar *single-char-ops* '(#\+ #\- #\* #\/ #\( #\) #\. #\, #\=))
> 
>  (defun single-char-op? (x) (member x *single-char-ops*))
> 
>  (defun tokenize (text)
>    (let ((chars '()) (result '()))
>      (flet ((next-token () 
>               (when chars 
>                 (push (coerce (nreverse chars) 'string) result)
>                 (setq chars '()))))
>        (dotimes (i (length text))
>          (let ((ch (char text i)))
>            (cond ((whitespace? ch) 
>                   (next-token))
>                  ((single-char-op? ch) 
>                   (next-token)
>                   (push ch chars)
>                   (next-token))
>                  (t
>                   (push ch chars)))))
>        (next-token)
>        (nreverse result))))
> 
>  (tokenize "foo+bar = baz")
>  => ("foo" "+" "bar" "=" "baz")
> 
> This could actually be optimized for speed in various obscure ways,
> but I tried to pick a simple way of presenting it so you could ge the
> general sense of typical style.
> 
> Many Lispers would prefer symbols to strings (e.g., so they can do 
> pointer comparisons) and so might modify the  tokenize routine's
> next-token operator as in:
> 
>  (defun tokenize (text)
>    (let ((chars '()) (result '()))
>      (flet ((next-token () 
>               (when chars 
>                 (push 
> 		  (LOOKUP-TOKEN
> 	              (coerce (nreverse chars) 'string))
> 		  result)
>                 (setq chars '()))))
>        (dotimes (i (length text))
>          (let ((ch (char text i)))
>            (cond ((whitespace? ch) 
>                   (next-token))
>                  ((single-char-op? ch) 
>                   (next-token)
>                   (push ch chars)
>                   (next-token))
>                  (t
>                   (push ch chars)))))
>        (next-token)
>        (nreverse result))))
> 
>  (defun lookup-token (string)
>    (intern             ; get an interned symbol
>      (string-upcase    ; fold case to uppercase
>        string)))
>    
> 
>  (tokenize "foo+BAR = Baz")
>  => (FOO + BAR = BAZ)
> 
> You could, of course, implement the token lookup in other ways, too.
> e.g.,
> 
>  (defvar *operator-table* (make-hash-table :test 'equal))
>  (setf (gethash "+" *operator-table*) 'plus)
>  (setf (gethash "*" *operator-table*) 'times)
>  (setf (gethash "/" *operator-table*) 'divided-by)
>  (setf (gethash "-" *operator-table*) 'minus)
>  (setf (gethash "=" *operator-table*) 'equals)
> 
>  (defun lookup-token (string)
>    (or (gethash string *operator-table*) ;case-sensitive
>        (intern (string-upcase string)))) ;case-folding
> 
>  (tokenize "Foo+BAR = Baz")
>  => (FOO PLUS BAR EQUALS BAZ)
> 
> If you're looking for something that will recognize more complex patterns
> than just simple break characters and single character operations, though,
> then you want to do some state management or regular expressions or
> something else...
> 
> - - - - 
> 
> Tokenization from a stream is similar, but uses 
>  (read-char stream) ;read next char or signal error if none
>  (read-char stream nil nil) ;read next char or return NIL if none
>  (peek-char nil stream) ;peek at next char, signal error if none
>  (peek-char nil stream nil nil) ;peek at next char, return NIL if none
> 
> - - - - 
> 
> It's also not that hard to build a datastructure for parsing operators
> forward.  Just make a tree that can track the state at any point. e.g.,
> 
>  state := ( default-token . transitionlist )
>  transitionlist ::= (( char1 . state1 )
>                      ( char2 . state2 )
>                      ( char3 . state3 ) ...)
> 
> In this way you can start with a transition list (initial chars of operators)
> and peek forward, a character at a time, to find the longest operator 
> matching.
> 
>  (defparameter *token-tree*
>    ;; normally you'd have a program assemble this rather than
>    ;; specifying it as a big, ugly lieral.
>    '((#\+ . ( + . ((#\+ . ( ++ )))))
>      (#\- . ( - . ((#\- . ( -- . ((#\> . ( --> ))))))))
>      (#\/ . ( / ))
>      (#\* . ( * ))
>      (#\( . ( *LPAREN* ))
>      (#\) . ( *RPAREN* ))))
> 
>  (defun read-char-or-token (stream)
>    (let* ((ch (read-char stream nil nil))
> 	  (entry (assoc ch *token-tree*)))
>      (if (not entry)
>          ch
>        (read-token (cdr entry) stream))))
> 
>  (defun read-token (state stream)
>    (let* ((ch (peek-char nil stream nil nil))
>           (entry (assoc ch (cdr state))))
>      (cond (entry 
>             (read-char stream) ;commit the char we peeked at
>             (read-token (cdr entry) stream))
>            (t ;no new state, so use default from previous state
>             (car state)))))
> 
>  (with-input-from-string (str "x-foo--")
>    (loop for item = (read-char-or-token str)
>          while item
>          do (print item)))
>  #\x
>  -
>  #\f
>  #\o
>  #\o
>  --
> 
> 
> Anyway, it's way past my bedtime and I'm sleepy.
> All of the above code is tested only lightly while half-asleep.
> Hope it helps, though.


-- 
Cesar Augusto Rorato Crusius            o      _     _         _
Stanford University            __o     /\_   _ \\o  (_)\__/o  (_)
··············@stanford.edu  _`\<,    _>(_) (_)/<_    \_| \   _|/' \/
www.stanford.edu/~crusius   (_)/(_)  (_)        (_)   (_)    (_)'  _\o_

He who sacrifices functionality for ease of use
Loses both and deserves neither

From: dr.mtarver
Subject: Re: Tokenizing streams in Lisp. How to?
Date: Fri, 11 May 2001 07:09:14 +0000
Message-ID: <7bMK6.1029$l6.174096@monolith.news.easynet.net>

Hi,

If you download Axel from www.spherum.com , you'll find a function
read-file-as-charlist that reads in a file as a list of characters.
There is also a function read-file-as-stringlist that takes a file name and
a function F of type character -> boolean and uses F as a test of
termination of tokens, assembling the the stream of characgter into a list
of strings.

Axel is written in Common Lisp and is specifically designed to run under
CLisp.  If you want to do the job in CLisp, then you could use the source
code in the Axel library.  If you want to learn Axel, then there's a load of
mat'l to help you.

regards

Mark Tarver

Cesar Crusius <·······@localhost.debian.org> wrote in message
·················@nntp.Stanford.EDU...
> Hi! Somewhat beginner in Lisp here, with the following problem: I want
> to write a lisp program that would take an input file as the argument,
> and tokenize it according to some rules. Is there a 'better' way to do
> this, whatever that means? I ask this because there may be some
> standard techniques that are particular to Lisp (I'm familiar with
> writing parsers in C++), or some package, or whatever. The more
> generic topic would be of course text file processing in Lisp. Any
> pointers would help! (If that's relevant, I'm using CLISP.)
>
> Thanks!
>
> --
> Cesar Augusto Rorato Crusius            o      _     _         _
> Stanford University            __o     /\_   _ \\o  (_)\__/o  (_)
> ··············@stanford.edu  _`\<,    _>(_) (_)/<_    \_| \   _|/' \/
> www.stanford.edu/~crusius   (_)/(_)  (_)        (_)   (_)    (_)'  _\o_
>
> He who sacrifices functionality for ease of use
> Loses both and deserves neither

From: Steve Long
Subject: Re: Tokenizing streams in Lisp. How to?
Date: Sat, 12 May 2001 00:36:21 +0000
Message-ID: <3AFC8584.40999B93@isomedia.com>

This works in PowerLisp and ACL, so should work in almost almost any CL
enviro.  Optimization
is left to the user.

(defparameter *eof* 'EOF) ; simple formality

(defun read-delimited-file
   (path-string
    token
    &key
    (if-does-not-exist :error)
    (ignore-comments nil))
  (let* ((list-of-lists))
    (flet ((read-and-parse-line (stream)
             (let ((line (read-line-sp stream ignore-comments)))
               (if (and (not (null line)) (not (eql line *eof*)))
                   (string-parse line token)
                 line))))
      (with-open-file
          (stream
           path-string
           :direction :input
           :if-does-not-exist if-does-not-exist)
        (do ((line-as-list
              (read-and-parse-line stream)
              (read-and-parse-line stream)))
            ((eql line-as-list *eof*))
          (push line-as-list list-of-lists)))
      (nreverse list-of-lists))))

(defun read-line-sp
    (stream &optional (ignore-comments nil))
  (let ((str (read-line stream nil *eof*)))
    (if (equalp str *eof*)
        *eof*
      (if (stringp str)
           (let* ((semi
                  (when (not ignore-comments)
                    (search ";" str)))
                 (start-com
                  (when (not ignore-comments)
                    (search "#|" str)))
                 (end-com
                  (when (not ignore-comments)
                    (search "|#" str)))
                 (sub-str
                  (cond
                   (semi (subseq str 0 semi))
                   ((and start-com end-com)
                    (concatenate 'string
                      (subseq str 0 start-com)
                      (subseq str (min (+ 2 end-com) (length str)))))
                   (start-com
                    (subseq str 0 start-com))
                   (end-com
                    (subseq str (min (+ 2 end-com) (length str))))
                   (t str))))
            (cond
             ((or
               (and semi (zerop semi))
               (and start-com (zerop (length sub-str)))
               (and end-com (zerop (length sub-str))))
              nil)
             ((zerop (length sub-str))
              *eof*)
             (t
              sub-str)))
        *eof*))))

(defun string-parse
    (string &optional (token #\space))
  (labels ((cons-list (p str list-out)
             (if p (cons (subseq str 0 p) list-out) list-out)))
    (do* ((str      string                (subseq str (1+ p)))
          (p        (position token str)  (position token str))
          (list-out (cons-list p str nil) (cons-list p str list-out)))
        ((null p) (nreverse (cons str list-out))))))




Cesar Crusius wrote:

> Hi! Somewhat beginner in Lisp here, with the following problem: I want
> to write a lisp program that would take an input file as the argument,
> and tokenize it according to some rules. Is there a 'better' way to do
> this, whatever that means? I ask this because there may be some
> standard techniques that are particular to Lisp (I'm familiar with
> writing parsers in C++), or some package, or whatever. The more
> generic topic would be of course text file processing in Lisp. Any
> pointers would help! (If that's relevant, I'm using CLISP.)
>
> Thanks!
>
> --
> Cesar Augusto Rorato Crusius            o      _     _         _
> Stanford University            __o     /\_   _ \\o  (_)\__/o  (_)
> ··············@stanford.edu  _`\<,    _>(_) (_)/<_    \_| \   _|/' \/
> www.stanford.edu/~crusius   (_)/(_)  (_)        (_)   (_)    (_)'  _\o_
>
> He who sacrifices functionality for ease of use
> Loses both and deserves neither