From: robi
Subject: Parsing a odd log file format
Date: 
Message-ID: <1171921638.141348.25750@v45g2000cwv.googlegroups.com>
List,
I am trying to learn Common LISP (I used to use LISP and Scheme in
University like 18 years ago ) and how I normally try to learn a
language is to create some little project and hack away at it.  My
little project is to create a Log parser in Lisp (I am using Allegro
IDE for development on Windows XP) and from their provide some
utilities for finding specific information in the logs.  The
particular logs I am working with are from a application server and
they are in a sort of Odd format that is kind of like a CSV file
except for the Java Stack trace which is not comma separated, not
quoted, and each line is separated by a hard return.
Here is an example:
"Error","jrpp-2733","01/10/07","03:04:00",,"Cannot find key 2jscript/
omniture.js?20061208 in struct.The specified key, 2jscript/omniture.js?
20061208, does not exist in the structure. The specific sequence of
files included or processed is: D:\WebData\www.underarmour.com\public
\Home.cfm "
coldfusion.runtime.IllegalStructAccessException: Cannot find key
2jscript/omniture.js?20061208 in struct.
	at coldfusion.runtime.Struct.StructFind(Struct.java:199)
	at coldfusion.runtime.CFPage.StructFind(CFPage.java:4345)
	at
cfApplicationManager2ecfc1600110577$funcINITIALIZEAPP.runFunction(D:
\WebData\com\underarmour\framework\ApplicationManager.cfc:48)
	at coldfusion.runtime.UDFMethod.invoke(UDFMethod.java:348)
	at coldfusion.runtime.UDFMethod
$ArgumentCollectionFilter.invoke(UDFMethod.java:258)
	at
coldfusion.filter.FunctionAccessFilter.invoke(FunctionAccessFilter.java:
56)
	at coldfusion.runtime.UDFMethod.runFilterChain(UDFMethod.java:211)
	at coldfusion.runtime.UDFMethod.invoke(UDFMethod.java:370)

I have found some LISP code as a starting point to parse the files but
it is for CSV files.  Would anyone mind assisting me in a strategy for
parsing my odd format?

Here is the LISP code I am using now:




;;; PARSE-CSV-LINE -- Parse one CSV line into a list of fields,
;;; stripping quotes and field-internal escape characters.
;;; Simple FSM with states '(:NORMAL :QUOTED :ESCAPED :QUOTED
+ESCAPED).
(defun parse-csv-line (line)
  (when (string= line "")
    (return-from parse-csv-line '()))
  ;; assert: line contains at least one field
  (loop for c across line
        with state = :normal
        and results = '()
        and chars = '()
    do (ecase state
         ((:normal)
          (case c
            ((#\") (setq state :quoted))
            ((#\\) (setq state :escaped))
            ((#\,)
             (push (coerce (nreverse chars) 'string) results)
             (setq chars '()))
            (t (push c chars))))
         ((:quoted)
          (case c
            ((#\") (setq state :normal))
            ((#\\) (setq state :quoted+escaped))
            (t (push c chars))))
         ((:escaped) (push c chars) (setq state :normal))
         ((:quoted+escaped) (push c chars) (setq state :quoted)))
    finally
     (progn
       (push (coerce (nreverse chars) 'string) results) ; close open
field
       (return (nreverse results)))))

;;; sample driver
(defun parse-csv-file (filename)
  (with-open-file (s filename)
         (loop for line = (read-line s nil nil)
               while line
           collect (parse-csv-line line))))


Thanks,

Robi

From: ·······@gmail.com
Subject: Re: Parsing a odd log file format
Date: 
Message-ID: <1171926931.917837.6460@t69g2000cwt.googlegroups.com>
On Feb 19, 10:47 pm, "robi" <·······@gmail.com> wrote:
> My little project is to create a Log parser in Lisp (I am using
> Allegro IDE for development on Windows XP) and from their
> provide some utilities for finding specific information in the
> logs.

Have you looked at this one?

  http://www.hpc.unm.edu/~download/LoGS/

Cheers,
Edi.
From: robi
Subject: Re: Parsing a odd log file format
Date: 
Message-ID: <1171928592.603788.58810@h3g2000cwc.googlegroups.com>
On Feb 19, 3:15 pm, ·······@gmail.com wrote:
> On Feb 19, 10:47 pm, "robi" <·······@gmail.com> wrote:
>
> > My little project is to create a Log parser in Lisp (I am using
> > Allegro IDE for development on Windows XP) and from their
> > provide some utilities for finding specific information in the
> > logs.
>
> Have you looked at this one?
>
>  http://www.hpc.unm.edu/~download/LoGS/
>
> Cheers,
> Edi.

That looks fantastic and I will most likely use that but I would still
like to understand how to deal with the part of my log file that is
not quoted, does not have a comma, etc for my own edification.  Part
of my goal is to understand the language not just parse logs :-)

Thanks very much though.  I am going to spend some time reading and
messing with LoGS
From: Pascal Bourguignon
Subject: Re: Parsing a odd log file format
Date: 
Message-ID: <87lkit66xz.fsf@thalassa.informatimago.com>
"robi" <·······@gmail.com> writes:
> That looks fantastic and I will most likely use that but I would still
> like to understand how to deal with the part of my log file that is
> not quoted, does not have a comma, etc for my own edification.  Part
> of my goal is to understand the language not just parse logs :-)

This is a general problem, that you have solved yourself at least
once, so you should be able to solve it again, of learning the grammar
of an unknown language.  You did it when you learned English, perhaps
as a baby.  Why couldn't you do it again now?

You could use one of the learning algorithms you can find in AI books
or paper to automatically find a grammar to parse these files, but
since it looks so simple, here is how you can do it:

Parse all the lines as csv, and the ones where you don't have the
expected count of fields, you consider they're stack frames.  You can
collect sequences of stack frames and attach them to the previous (or
following csv line).

So the grammar is:

<file>   ::= | <entry> <file> .
<entry>  ::= <csv> <frames> .
<frames> ::= | <frame> <frames> .
<csv>    ::= <field> "," <field> "," <field> "," <field> "," <field> .
             --  for example, if the csv records have 5 fields.
<frame>  ::= <any line that is not a csv line> .


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
You're always typing.
Well, let's see you ignore my
sitting on your hands.
From: robi
Subject: Re: Parsing a odd log file format
Date: 
Message-ID: <1171995851.974884.129080@j27g2000cwj.googlegroups.com>
Pascal,

I think I miscommunicated what I was looking for.  I am not looking
for strategic help but rather help on a tactical level (i.e. syntax).
So since I am just learning the language again after not touching it
for two decades I am having a hard time figuring out how to look for
specific patterns.  If you have a good resource that shows various
methods of matching patterns in LISP that would be more helpful to me.

Many thanks though for your reply.

Robi


> "robi" <·······@gmail.com> writes:
> > That looks fantastic and I will most likely use that but I would still
> > like to understand how to deal with the part of my log file that is
> > not quoted, does not have a comma, etc for my own edification.  Part
> > of my goal is to understand the language not just parse logs :-)
>
> This is a general problem, that you have solved yourself at least
> once, so you should be able to solve it again, of learning the grammar
> of an unknown language.  You did it when you learned English, perhaps
> as a baby.  Why couldn't you do it again now?
>
> You could use one of the learning algorithms you can find in AI books
> or paper to automatically find a grammar to parse these files, but
> since it looks so simple, here is how you can do it:
>
> Parse all the lines as csv, and the ones where you don't have the
> expected count of fields, you consider they're stack frames.  You can
> collect sequences of stack frames and attach them to the previous (or
> following csv line).
>
> So the grammar is:
>
> <file>   ::= | <entry> <file> .
> <entry>  ::= <csv> <frames> .
> <frames> ::= | <frame> <frames> .
> <csv>    ::= <field> "," <field> "," <field> "," <field> "," <field> .
>              --  for example, if the csv records have 5 fields.
> <frame>  ::= <any line that is not a csv line> .
>
> --
> __Pascal Bourguignon__                    http://www.informatimago.com/
> You're always typing.
> Well, let's see you ignore my
> sitting on your hands.
From: Pascal Bourguignon
Subject: Re: Parsing a odd log file format
Date: 
Message-ID: <87tzxg30gr.fsf@thalassa.informatimago.com>
"robi" <·······@gmail.com> writes:

> Pascal,
>
> I think I miscommunicated what I was looking for.  I am not looking
> for strategic help but rather help on a tactical level (i.e. syntax).
> So since I am just learning the language again after not touching it
> for two decades I am having a hard time figuring out how to look for
> specific patterns.  If you have a good resource that shows various
> methods of matching patterns in LISP that would be more helpful to me.

Oh.  Well, my point is that you can just come with any ad-hoc method
that does what you want.

Copying here your example, but removing some newlines I think were
added by your newsgroup software:

Here is an example:
"Error","jrpp-2733","01/10/07","03:04:00",,"Cannot find key 2jscript/omniture.js?20061208 in struct.The specified key, 2jscript/omniture.js?20061208, does not exist in the structure. The specific sequence of files included or processed is: D:\WebData\www.underarmour.com\public\Home.cfm "
coldfusion.runtime.IllegalStructAccessException: Cannot find key 2jscript/omniture.js?20061208 in struct.
	at coldfusion.runtime.Struct.StructFind(Struct.java:199)
	at coldfusion.runtime.CFPage.StructFind(CFPage.java:4345)
	at cfApplicationManager2ecfc1600110577$funcINITIALIZEAPP.runFunction(D:\WebData\com\underarmour\framework\ApplicationManager.cfc:48)
	at coldfusion.runtime.UDFMethod.invoke(UDFMethod.java:348)
	at coldfusion.runtime.UDFMethod $ArgumentCollectionFilter.invoke(UDFMethod.java:258)
	at coldfusion.filter.FunctionAccessFilter.invoke(FunctionAccessFilter.java:56)
	at coldfusion.runtime.UDFMethod.runFilterChain(UDFMethod.java:211)
	at coldfusion.runtime.UDFMethod.invoke(UDFMethod.java:370)


We notice that the csv line seems to have 6 fields, that is, at least
five commas.  So you could just count them:

;; pseudo-code
(if (<= 5 (count #\, line))
    (parse-csv line)
    (collect-stack-frame line))


Another thing we notice, is that actually, the stack-frame seems to
have only the first line starting in column 1, so you could easily
collect the stack frame with:

(defun collect-stack-frame (stream first-line)
  (loop :collect first-line :into stack
        :for next-line = (read-line stream nil nil)
        :while (and next-line (char= #\space (aref next-line 0)))
        :collect next-line :into stack
        ;; We return the last line read, which will probably be a csv line
        ;; for the next record.
        :finally (return stack next-line)))


I guess the "tactic" used here to come with a test is to find
differences between the different kind of input.  In your example,
there are three different kinds of inputs:

"...","...","...","...","...","..."...

blah blah blah

    at blah blab blah


We could just test for " on the first column, space on the first
column and otherwise to distinguish them.  That'd be the fastest way
to determine what kind of line we have:

(case (aref line 0)
 ((#\")     (parse-csv                line))
 ((#\space) (collect-stack-frame-at   line))
 (otherwise (collect-stack-frame-head line)))

But we know that in CSV files, the double quote are optional, and that
in log files lines are usually structured as records of a fixed (or at
least minimal) number of fields, so the commas are more significant.



There's a risk to get a initial stack-frame line containing several
comma, perhaps more than five.

coldfusion.runtime.IllegalStructAccessException: Cannot "a,b,c,d,e,f,g" is not an integer.

So we could rather test for ^[a-zA-Z0-9_.]*: instead of just (not #\space).




What I was saying in the previous post was to use the csv parser,
assuming it would return a one-field record on lines with no comma (or
at least, a record with less than six fields).  The specific pattern
of a csv line being recognized by the csv parser itself.

You'd write then;

;; pseudo-code
(let ((record (parse-csv line)))
   (if (<= 6 (length (record-fields record)))
       (got-a-csv record)
       (collect-stack-frame line)))



Otherwise, you may want to try to detect whether it looks like a csv
line, before calling the csv parser, with a regexp.

There are several regexp packages available.  In clisp, we have access
to the standard regex(3) library thru the REGEXP package.  CL-PCRE is
highly regarded too, but I've never used it, regex(3) being enough to
my needs.  http://www.cliki.net/regular%20expression

So we could write:

;; pseudo-code
(if #+clisp(regexp:regexp-match ".*,.*,.*,.*,.*" line)
    #-clisp(error "Please fetch a regular expression package and implement this")
    (parse-csv line)
    (collect-stack-frame line))

Note that a regular expression such as that one ".*,.*,.*,.*,.*,.*"
could be compiled by the regular expression engine to something that
is essentially equivalent to (<= 5 (count #\, line)) [well, plus a
shortcut to stop counting after 5 commas are found].





So matching patterns in strings can be done with various kind of
recognizers; in simple cases you can write an ad-hoc one with COUNT,
POSITION, SEARCH, etc.  For more complex patterns, or for ease of
specifying them, you can use regular expressions.  Or you can just use
a parser for a full context free grammar: if the parser can parse the
text without error, you've recognized the "pattern" of a word of that
language, if an error occurs, then you've not.  [The CSV format
doesn't need a context free grammar, a regular expression is enough;
but the point is that you can just delegate the pattern matching to a
parsing function].


When you write an ad-hoc recognizer, you are actually implementing a
specific parser for some language.  Hopefully, this language is a
superset of the input language: you may accept illegal strings, that
wouldn't occur in the input, but you should "classify" all input
strings into the correct category (parse it as the correct
non-terminal).  







There is also what is called "pattern matching" in general in lisp,
which applies not only to strings, but in general to lists, vectors,
structures, arrays, any type of object.

> Many thanks though for your reply.

Hope this helps.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
Kitty like plastic.
Confuses for litter box.
Don't leave tarp around.