CLAWK Confusion

From: Jack
Subject: CLAWK Confusion
Date: Fri, 10 Mar 2006 16:33:26 +0000
Message-ID: <1142008406.820567.85050@u72g2000cwu.googlegroups.com>

I'd be grateful if anyone can help me with this problem.

I have a working AWK script that I want to convert to CLAWK. The AWK
script is as follows:

BEGIN { FS = "\t" }
/Tax Log Report/ { next }
/Total/ { next }
/Due to rounding/ { next }
/Friday, December 10, 2004/ { next }
$1 ~ /State/ { next }
$2 ~ /JurisID/ { next }
{ s1 = "12-10-2004"
  if (trim($2) == "0") {
    dump( s1, trim($1), "", "", trim($2), trim($3), trim($4), trim($5),
\
      demonetize($6), demonetize($7), demonetize($8), demonetize($9), \
      trim($10), demonetize($11) )
  } else {
    dump( s1, trim($1), trim($2), trim($3), trim($4), trim($5),
trim($6), trim($7), \
      demonetize($8), demonetize($9), demonetize($10), demonetize($11),
\
      trim($12), demonetize($13) )
  }
}

function dump(s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13,
s14) {
    printf
"\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n",
\
      s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14
}

function trim(str) {
    gsub(/^[ \t]+/, "", str)
    gsub(/[ \t]+$/, "", str)
    return str
}

function demonetize(str) {
    str = trim(str)
    sub(/\$/, "", str)
    gsub(/\,/, "", str)
    return str
}


Using sbcl-0.9.9, Emacs and slime, I came up with the following first
try:



(in-package :clawk-user)

(install-regex-syntax)

...

(let ((reported  ()))
  (defun transform-log (report-date file)
     (setq reported report-date)
     (log2csv file))

  (defawk log2csv ()
     (BEGIN
       (setq *FS* "\\t"))
     (#/Tax Log Report/ (next))
     (#/Total/ (next))
     (#/Due to rounding/ (next))
     (reported (next))
     ((~ $1 "State") (next))
     ((~ $2 "JurisID") (next))
     (t (if (~ (trim $2) "0")
          (dump reported (trim $1) "" "" (trim $2) (trim $3) (trim $4)
(trim $5)
                (demonetize $6) (demonetize $7) (demonetize $8)
(demonetize $9)
                (trim $10) (trim $11))
          (dump reported (trim $1) (trim $2) (trim $3) (trim $4) (trim
$5)
                (trim $6) (trim $7)
                (demonetize $8) (demonetize $9) (demonetize $10)
(demonetize $11)
                (trim $12) (trim $13))))))

(transform-log "Friday, December 10, 2004"
"/home/stranger/tax/text/20041215-21.txt")

(NOTE: Of course, I realize that there might be a problem with CLAWK
correctly matching REPORTED. However, this isn't where I currently have
an issue. Evaluating this code complains that the function NEXT is
undefined.)

I'm baffled about HOW to skip a line, a la AWK's next function.

>From CLAWK's documentation at
http://www.geocities.com/mparker762/clawk.html#clawk I see that:

"DO-FILE-FIELDS is a macro that takes a filename, an optional
field-list, and a set of body forms. It opens the file in read mode,
and loops over the lines in the file splitting them into the specified
field variables (or the $n variables if no field list was given), and
executes the body forms in this new environment. Because of the
splitting overhead, it's not as snappy as FOR-FILE-LINES. Executing
(NEXT) within the body will restart the processing at the next input
line.

DO-STREAM-FIELDS is a closely related version that takes a stream (or t
for the standard input stream)."

Of course, the documentation is outdated. The source file, clawk.lisp,
shows that DO-STREAM-FIELDS does not exist, and FOR-STREAM-FIELDS
provides this functionality:

(defmacro for-stream-fields ((stream &optional fieldspec
                                              (strmvar (gensym))
                                              (linevar (gensym)))
                                     &rest body)
  (expand-for-stream-fields strmvar linevar fieldspec stream body))

(defun expand-for-stream-fields (strmvar linevar fieldspec stream body
                                         &key (curfile-name stream
curfile-name-p)
                                         &aux (nextlbl (gensym)))
  `(let ((,strmvar ,stream))
     (if (eq ,strmvar 't)
         (setq ,strmvar *standard-input*))
     (unless (null ,strmvar)
       (let (,@(if curfile-name-p `((*CURFILE* ,curfile-name)))
             (*CURLINE* "")
             (*FNR* -1))
         (macrolet ((next () (list 'throw ',nextlbl nil)))
           (prog ,(if (eq linevar '*CURLINE*) nil (list linevar))
             ,nextlbl
                 (setq ,linevar (read-line ,strmvar nil :eof))
                 (unless (eq ,linevar :eof)
                    (setq *CURLINE* ,linevar
                          $0 ,linevar)
                    (incf *NR*)
                    (incf *FNR*)
                    (catch ',nextlbl
                      ,(expand-with-fields fieldspec linevar '*FS*
body))
                    (go ,nextlbl))))))))

Of course, the CLAWK package does not export the NEXT symbol. Expanding
log2csv once gives the following:

(DEFUN LOG2CSV (&REST CLAWK::ARGS)
  (LET ((*CURFILE*)
        (*CURLINE* "")
        (*FS* CLAWK::+WS-FIELDSEP-PAT+)
        (*RSTART* 0)
        (*RLENGTH* 0)
        (*REND* 0)
        (*REGS* (MAKE-ARRAY 0))
        (*FIELDS* (MAKE-ARRAY 0))
        (*NR* 0)
        (*FNR* 0)
        (*NF* 0)
        (*SUBSEP* (STRING (CODE-CHAR 28)))
        (CLAWK::*OFS* " ")
        (CLAWK::*ORS* "n")
        (*LAST-SUCCESSFUL-MATCH*))
    (DECLARE
     (SPECIAL *CURFILE*
              *CURLINE*
              *FS*
              *RSTART*
              *RLENGTH*
              *REND*
              *REGS*
              *FIELDS*
              *NR*
              *FNR*
              *NF*
              *SUBSEP*
              CLAWK::*OFS*
              CLAWK::*ORS*
              *LAST-SUCCESSFUL-MATCH*))
    (FLET ((#:G1002 (#:G1003)
             (LET ((#:G1003 #:G1003))
               (IF (EQ #:G1003 'T) (SETQ #:G1003 *STANDARD-INPUT*))
               (UNLESS (NULL #:G1003)
                 (LET ((*CURLINE* "") (*FNR* -1))
                   (MACROLET ((CLAWK::NEXT ()
                                (LIST 'THROW '#:G1004 NIL)))
                     (PROG ()
                      #:G1004
                       (SETQ *CURLINE* (READ-LINE #:G1003 NIL :EOF))
                       (UNLESS (EQ *CURLINE* :EOF)
                         (SETQ *CURLINE* *CURLINE* $0 *CURLINE*)
                         (INCF *NR*)
                         (INCF *FNR*)
                         (CATCH '#:G1004
                           (LET (*FIELDS*)
                             (LET* ((#:G1005 (SPLIT *CURLINE* *FS*))
                                    (*NF* (LENGTH #:G1005)))
                               (WHEN (MATCH *CURLINE* '|Tax Log
Report|)
                                 (NEXT))
                               (WHEN (MATCH *CURLINE* '|Total|) (NEXT))
                               (WHEN (MATCH *CURLINE* '|Due to
rounding|)
                                 (NEXT))
                               (WHEN REPORTED (NEXT))
                               (WHEN (~ $1 "State") (NEXT))
                               (WHEN (~ $2 "JurisID") (NEXT))
                               (WHEN T
                                 (IF (~ (TRIM $2) "0")
                                     (DUMP REPORTED
                                           (TRIM $1)
                                           ""
                                           ""
                                           (TRIM $2)
                                           (TRIM $3)
                                           (TRIM $4)
                                           (TRIM $5)
                                           (DEMONETIZE $6)
                                           (DEMONETIZE $7)
                                           (DEMONETIZE $8)
                                           (DEMONETIZE $9)
                                           (TRIM $10)
                                           (TRIM $11))
                                     (DUMP REPORTED
                                           (TRIM $1)
                                           (TRIM $2)
                                           (TRIM $3)
                                           (TRIM $4)
                                           (TRIM $5)
                                           (TRIM $6)
                                           (TRIM $7)
                                           (DEMONETIZE $8)
                                           (DEMONETIZE $9)
                                           (DEMONETIZE $10)
                                           (DEMONETIZE $11)
                                           (TRIM $12)
                                           (TRIM $13)))))))
                         (GO #:G1004)))))))))
      (SETQ *FS* "\\t")
      (DOLIST (#:G1001 CLAWK::ARGS)
        (COND ((EQ #:G1001 'T) (#:G1002 *STANDARD-INPUT*))
              ((AND (STREAMP #:G1001) (INPUT-STREAM-P #:G1001))
               (#:G1002 #:G1001))
              ((OR (PATHNAMEP #:G1001) (STRINGP #:G1001))
               (LET ((*CURFILE* #:G1001))
                 (DECLARE (SPECIAL *CURFILE*))
                 (WITH-OPEN-FILE
                     (#:G1003 #:G1001
                              :DIRECTION
                              :INPUT
                              :ELEMENT-TYPE
                              'CHARACTER
                              :IF-DOES-NOT-EXIST
                              :ERROR)
                   (#:G1002 #:G1003)))))))))


Notice the (MACROLET ((CLAWK::NEXT () ...)

Changing calls to (NEXT) to (CLAWK:NEXT) complains that CLAWK doesn't
export the symbol, as you probably already knew.

Changing the package to export the symbol NEXT and recompiling doesn't
do any good, as it then complains that an uninterned symbol (#:G3, as I
recall) is undefined/unbound--I don't recall the exact error. If I use
(in-package :clawk) instead of (in-package :clawk-user) it gives me the
same error about the uninterned symbol.

Careful analysis of clawk.lisp reveals NO way to otherwise implement
(NEXT) outside of the package, as far as I can determine.

My question is this: Since DEFAWK expands the clauses within a call to
expand-for-stream-fields, shouldn't (NEXT) be available in the action
clauses wrapped within a DEFAWK?

Otherwise, can anyone determine how to implement AWK's next
functionality with CLAWK?

Re: CLAWK Confusion Jack
- Re: CLAWK Confusion Michael Parker
  - Re: CLAWK Confusion Jack
    - Re: CLAWK Confusion Michael Parker
Re: CLAWK Confusion Kaz Kylheku
- Re: CLAWK Confusion Jack
  - Re: CLAWK Confusion Michael Parker
    - Re: CLAWK Confusion Jack

From: Jack
Subject: Re: CLAWK Confusion
Date: Fri, 10 Mar 2006 22:05:24 +0000
Message-ID: <1142028324.788578.157550@e56g2000cwe.googlegroups.com>

;My question is this: Since DEFAWK expands the clauses within a call to
;expand-for-stream-fields, shouldn't (NEXT) be available in the action
;clauses wrapped within a DEFAWK?
;
;Otherwise, can anyone determine how to implement AWK's next
;functionality with CLAWK?

I decided that it would require a workaround, and I cobbled together
something that worked. The workaround isn't very lispy, but I didn't
have the ambition to patch clawk.lisp.

Is it an SBCL bug that prevents the usage of (NEXT) from
(EXPAND-FOR-STREAM-FIELDS), or is it simply a mis-feature of CLAWK? I'm
still curious about the original author's design, so if anyone knows
how to do it using the CLAWK API then please do tell.

From: Michael Parker
Subject: Re: CLAWK Confusion
Date: Sat, 11 Mar 2006 01:10:49 +0000
Message-ID: <100320061927016444%michaelparker@earthlink.net>

In article <························@e56g2000cwe.googlegroups.com>,
Jack <············@msn.com> wrote:

> ;My question is this: Since DEFAWK expands the clauses within a call to
> ;expand-for-stream-fields, shouldn't (NEXT) be available in the action
> ;clauses wrapped within a DEFAWK?
> ;
> ;Otherwise, can anyone determine how to implement AWK's next
> ;functionality with CLAWK?
> 
> I decided that it would require a workaround, and I cobbled together
> something that worked. The workaround isn't very lispy, but I didn't
> have the ambition to patch clawk.lisp.
> 
> Is it an SBCL bug that prevents the usage of (NEXT) from
> (EXPAND-FOR-STREAM-FIELDS), or is it simply a mis-feature of CLAWK? I'm
> still curious about the original author's design, so if anyone knows
> how to do it using the CLAWK API then please do tell.
> 

This is a bug in the current version of CLAWK.  I ran into it myself
yesterday and fixed it, but haven't yet uploaded the new version.  I'll
try to get it up tomorrow.

From: Jack
Subject: Re: CLAWK Confusion
Date: Sat, 11 Mar 2006 07:22:56 +0000
Message-ID: <1142061776.132132.173580@i40g2000cwc.googlegroups.com>

Michael Parker wrote:
> This is a bug in the current version of CLAWK.  I ran into it myself
> yesterday and fixed it, but haven't yet uploaded the new version.  I'll
> try to get it up tomorrow.

Thanks for the 411. The logic did seem a bit hairy, as the catch thrown
by (NEXT) falls below where it reads the next line, and the label is a
gensym. I wasn't sure if the (NEXT) macro should work anyway. When I
tried to use CLISP with a library that uses a toplevel progn I ran into
problems with unbound uninterned symbols, and I considered that this
problem might be a similar bug in SBCL. It'll be interesting to see how
you managed to fix it.

From: Michael Parker
Subject: Re: CLAWK Confusion
Date: Sat, 11 Mar 2006 15:01:09 +0000
Message-ID: <110320060917237935%michaelparker@earthlink.net>

In article <························@i40g2000cwc.googlegroups.com>,
Jack <············@msn.com> wrote:

> Michael Parker wrote:
> > This is a bug in the current version of CLAWK.  I ran into it myself
> > yesterday and fixed it, but haven't yet uploaded the new version.  I'll
> > try to get it up tomorrow.
> 
> Thanks for the 411. The logic did seem a bit hairy, as the catch thrown
> by (NEXT) falls below where it reads the next line, and the label is a
> gensym. I wasn't sure if the (NEXT) macro should work anyway. When I
> tried to use CLISP with a library that uses a toplevel progn I ran into
> problems with unbound uninterned symbols, and I considered that this
> problem might be a similar bug in SBCL. It'll be interesting to see how
> you managed to fix it.
> 

I think I just had to change the macrolet to a flet, but the fix for
the (next) bug is mixed up in a bunch of other changes so I'm not
certain (I was trying to parse a file that was a mixture of english
prose, tab-delimited fields, fixed-width fields, and C++ typedef,
class, and struct declarations.  The parser is working, but it took
clawk, deflexer, and lispworks' handy defparser to get it whipped :-/

From: Kaz Kylheku
Subject: Re: CLAWK Confusion
Date: Sat, 11 Mar 2006 00:15:33 +0000
Message-ID: <1142036133.866117.118710@z34g2000cwc.googlegroups.com>

Jack wrote:
> I'd be grateful if anyone can help me with this problem.
>
> I have a working AWK script that I want to convert to CLAWK. The AWK
> script is as follows:

Awk? That's like Perl's dumber grand uncle.

> (in-package :clawk-user)

Try using CL-PPCRE and LOOP:

  (with-open-file (f ...)
    (loop
      for line = (read-line f nil nil)
      while line
      do ...   bunch of regular expression matches with CL-PPCRE ...
      ))

CL-PPCRE has binding macros for "destructuring" strings with regular
expressions.

Awk's field-splitting can be done using CL-PPCRE::SPLIT. You can do
DESTRUCTURING-BIND on the resulting list to get it into variables.

From: Jack
Subject: Re: CLAWK Confusion
Date: Sat, 11 Mar 2006 07:58:19 +0000
Message-ID: <1142063898.973864.34290@p10g2000cwp.googlegroups.com>

> Awk? That's like Perl's dumber grand uncle.

Each tool has its place. I hadn't previously used AWK, but it appeared
to be the best tool for the job. Unfortunately, it doesn't seem to
scale well. I soon realized that because I need to process multiple
files with different dates I would run into the problem of maintaining
multiple scripts or one very ugly, difficult-to-maintain script, and
this project involves enough headaches without adding more problems to
it.

> Try using CL-PPCRE and LOOP:

>  (with-open-file (f ...)
>    (loop
>      for line = (read-line f nil nil)
>      while line
>      do ...   bunch of regular expression matches with CL-PPCRE ...
>      ))

>CL-PPCRE has binding macros for "destructuring" strings with regular
>expressions.

> Awk's field-splitting can be done using CL-PPCRE::SPLIT. You can do
> DESTRUCTURING-BIND on the resulting list to get it into variables.

I'm not very familiar with CL-PPCRE. Since I had written the logic for
the awk script and it worked, I considered that it would be expedient
to use an API for which the translation wouldn't require rewriting the
logic. Whereas CL-PPCRE might fit the solution, using it would change
the shape of that solution. Given your proposal, I'd say that CLAWK
offers a cleaner solution. With GAWK or CLAWK I can go straight to the
task of writing the text processing and filtering, without wrangling
with LOOP or even opening the file. If not for the (NEXT) bug in CLAWK,
you probably wouldn't have heard a peep from me. The translation was
otherwise almost completely lateral.

On the surface, your solution looks slightly more complicated than my
solution. Perhaps more knowledge about CL-PPCRE would sway me. Instead
of a skeleton, show me a translation of my AWK script using CL-PPCRE.
Would it provoke visions of maintenance nightmares? Would it start to
look like perl? Would it give me a headache just looking at it? If it
has more appeal than the CLAWK solution, I might consider it.
Unfortunately, I don't know enough about CL-PPCRE to enable me to
visualize doing it CL-PPCRE and LOOP.

Thanks for the suggestion.

From: Michael Parker
Subject: Re: CLAWK Confusion
Date: Sat, 11 Mar 2006 15:25:39 +0000
Message-ID: <110320060941526117%michaelparker@earthlink.net>

In article <·······················@p10g2000cwp.googlegroups.com>, Jack
<············@msn.com> wrote:

> > Awk? That's like Perl's dumber grand uncle.
> 
> Each tool has its place. I hadn't previously used AWK, but it appeared
> to be the best tool for the job. Unfortunately, it doesn't seem to
> scale well. I soon realized that because I need to process multiple
> files with different dates I would run into the problem of maintaining
> multiple scripts or one very ugly, difficult-to-maintain script, and
> this project involves enough headaches without adding more problems to
> it.

I've never been a big fan of perl, partly because I had done some major
AWK work (10k line CBASIC->MS Pro Basic translator) before perl came
around, so I knew AWK inside and out, and knew that AWK code was
surprisingly maintainable.  That was actually one of the things that
made me write clawk (well, that and the lack of AWK on the lisp
machines) -- I wanted real data structures under me, not just those
wierd string->string associative arrays, and I wanted better regex
support.  AWK regexes compile to state machines (at least the original
AWK regexes did) so you couldn't get submatch registers, which meant
that you would up messing around with match() too much for my tastes.
Once I had written the regex matcher, doing an awk-like system on top
of it was trivial, as was the lexer-generator.  Combine those three
with a parser-generator and the data-structure handling capabilities of
lisp, and you've got one heck of a text-processing chainsaw.

> 
> > Try using CL-PPCRE and LOOP:
> 
> >  (with-open-file (f ...)
> >    (loop
> >      for line = (read-line f nil nil)
> >      while line
> >      do ...   bunch of regular expression matches with CL-PPCRE ...
> >      ))
> 
> >CL-PPCRE has binding macros for "destructuring" strings with regular
> >expressions.
> 
> > Awk's field-splitting can be done using CL-PPCRE::SPLIT. You can do
> > DESTRUCTURING-BIND on the resulting list to get it into variables.
> 

Or you can use CLAWK's SPLIT.  Or WITH-FIELDS, which both splits and 
destructures.  Or you can use WITH-MATCH, which applies a regex and
destructures the submatches into variables for you.  Or MATCH-CASE,
which looks sort of like CL CASE but uses regex matching instead of
eq/memq, and allows you to destructure the register submatches within a
particular case using WITH-SUBMATCHES.

From: Jack
Subject: Re: CLAWK Confusion
Date: Sun, 12 Mar 2006 12:44:51 +0000
Message-ID: <1142167491.496883.66660@u72g2000cwu.googlegroups.com>

Michael Parker wrote:
> I've never been a big fan of perl, partly because I had done some major
> AWK work (10k line CBASIC->MS Pro Basic translator) before perl came
> around, so I knew AWK inside and out, and knew that AWK code was
> surprisingly maintainable.  That was actually one of the things that
> made me write clawk (well, that and the lack of AWK on the lisp
> machines) -- I wanted real data structures under me, not just those
> wierd string->string associative arrays, and I wanted better regex
> support.  AWK regexes compile to state machines (at least the original
> AWK regexes did) so you couldn't get submatch registers, which meant
> that you would up messing around with match() too much for my tastes.
> Once I had written the regex matcher, doing an awk-like system on top
> of it was trivial, as was the lexer-generator.  Combine those three
> with a parser-generator and the data-structure handling capabilities of
> lisp, and you've got one heck of a text-processing chainsaw.

Kudos to you for the CLAWK project... The combination of LISP and AWK
is quite the powerful combination. In less than 100 lines of LISP, I
converted two dozen files containing a combined total of somewhere
around 70k records, from tab-delimited to CSV, filtering out headers,
footers and other miscellaneous cruft and adding a date column, for the
purpose of preparing the data for a database. It would have required
less code, had the input not needed some tweaking due to missing tabs.
(The tabs were missing in the original RTF files, which I converted to
plain tab-delimited text with OpenOffice.) I could have done without
the trim function, too, since LISP has an analogous function, but I
wasn't thinking when I converted the original AWK script.

Wrapping calls to (TRANSFORM-LOG) with a redirection of
*STANDARD-OUTPUT* in a (WITH-OPEN-FILE) (within a (LAMBDA) passed to a
(MAPCAR), given a list of tax logs) after replacing the T stream
designator to *STANDARD-OUTPUT* in the call to (FORMAT), made it even
sweeter. The information compression is amazing, considering what it
might have taken with C or (ACK!) Java.

I'm impressed with how easy LISP and CLAWK made the job, and I thank
you for the CLAWK effort.

>
> Or you can use CLAWK's SPLIT.  Or WITH-FIELDS, which both splits and
> destructures.  Or you can use WITH-MATCH, which applies a regex and
> destructures the submatches into variables for you.  Or MATCH-CASE,
> which looks sort of like CL CASE but uses regex matching instead of
> eq/memq, and allows you to destructure the register submatches within a
> particular case using WITH-SUBMATCHES.

Though it wasn't absolutely necessary, I did modify the DEFAWK
definition to use WITH-FIELDS in the T action form, to make the code
more readable. While it added a few lines to the code, I compensated
for it by moving the (TRIM) and (DEMONETIZE) calls to (DUMP).