using SERIES for log-parsing

From: Rolf Rander Næss
Subject: using SERIES for log-parsing
Date: Wed, 07 Apr 2004 18:53:42 +0000
Message-ID: <ds5y8p71ux5.fsf@bacchus.pvv.ntnu.no>

Hi, I have just started looking into series, making a small system for
analyzing debug-logs, and I have come up with a few questions:

The first thing I'm doing is reading lines from an access.log with
scan-file and parsing these into clos-objects:

  (defun make-access-series (path)
     (choose-if (complement #'null)
           (map-fn t #'access-parse
                   (scan-file path #'read-line))))

access-parse parses a line, does some simple filtering (to avoid
making uneccessary objects) and returns an access-log object.

To test this, I typically do something like:

  (subseries (make-access-series #P"log:access.log") 0 2)

Giving me the first two parsed access-log objects.

1. Series are supposed to be lazy, which I though meant that in this
   example only the first two lines would need to be read (given that
   the access-parse function accepts these), but running this code
   leads to heavy disk activity.  Have I misunderstood "lazy", or is
   this implementation-dependent?  (I'm using series from CLOCC on
   clisp)

2. Once the parsing is debugged, the next step is to include this in
   higher-order functions.  This means both a developement fase, where
   I only need a small set of data (the same data over and over
   will do nicely), and doing the full analysis, where I'll probably
   want to run the same set of data through different aggregations.
   In both these cases the reading and parsing really only needs to be
   done once, and I wonder: how can I add caching to a given stage in
   this analysis, and upon re-read of the series, read from the cache
   instead of re-doing the read-parse?  If this was a shell-script, I
   would do:
     sed parselog.sed < logfile | tee log.tmp1 | analysis-1
   and later on:
     analysis-2 < log.tmp1
   In other words: how can I make "tee" with a series?
   (I imagine the CL solution can be a lot smarter than tee, having
   implicit caching without the caller knowing, and automatically
   continuing to read the original series once past the length of the
   cache)

3. Some log-files have entries spanning several lines.  Is it possible
   to make a function collecting a number of entries from an
   input-series and having the result returned as a new series?  Like
   chunk, but the number of entries in each chunk is not known in
   advance.


regards, rolf rander

-- 
http://www.pvv.org/~rolfn/

"Den som kun tar sp�k for sp�k og alvor kun alvorligt, han og hun har
 faktisk fattet begge deler d�rlig"   -- Piet Hein

Re: using SERIES for log-parsing Raymond Toy
Re: using SERIES for log-parsing Rolf Rander Næss
Re: using SERIES for log-parsing Dirk Gerrits
- Re: using SERIES for log-parsing Rolf Rander Næss
  - Re: using SERIES for log-parsing Rahul Jain

From: Raymond Toy
Subject: Re: using SERIES for log-parsing
Date: Wed, 07 Apr 2004 21:03:25 +0000
Message-ID: <4n4qrvqz4y.fsf@edgedsp4.rtp.ericsson.se>

>>>>> "Rolf" == Rolf Rander N�ss <··········@pvv.org> writes:

    Rolf> Hi, I have just started looking into series, making a small system for
    Rolf> analyzing debug-logs, and I have come up with a few questions:

    Rolf> The first thing I'm doing is reading lines from an access.log with
    Rolf> scan-file and parsing these into clos-objects:

    Rolf>   (defun make-access-series (path)
    Rolf>      (choose-if (complement #'null)
    Rolf>            (map-fn t #'access-parse
    Rolf>                    (scan-file path #'read-line))))

    Rolf> access-parse parses a line, does some simple filtering (to avoid
    Rolf> making uneccessary objects) and returns an access-log object.

    Rolf> To test this, I typically do something like:

    Rolf>   (subseries (make-access-series #P"log:access.log") 0 2)

Don't you need to declare make-access-series to be a series function?
I think if you don't, make-access-series will read every single line
and create a series (from choose-if).  This then gets passed to
subseries.

If you declare make-access-series to be a series function, then series
can optimize it and then only the first 2 lines will be read.

Ray

From: Rolf Rander Næss
Subject: Re: using SERIES for log-parsing
Date: Wed, 07 Apr 2004 19:13:20 +0000
Message-ID: <ds5ekqz1u0f.fsf@bacchus.pvv.ntnu.no>

··········@pvv.org (Rolf Rander N�ss) writes:
>(I'm using series from CLOCC on clisp)

No, ofcourse not.  Series comes from: http://series.sourceforge.net/


rolf rander

-- 
http://www.pvv.org/~rolfn/

"Den som kun tar sp�k for sp�k og alvor kun alvorligt, han og hun har
 faktisk fattet begge deler d�rlig"   -- Piet Hein

From: Dirk Gerrits
Subject: Re: using SERIES for log-parsing
Date: Fri, 09 Apr 2004 19:09:57 +0000
Message-ID: <9yCdc.75407$_z4.24882@amsnews05.chello.com>

Rolf Rander N�ss wrote:
> Hi, I have just started looking into series, making a small system for
> analyzing debug-logs, and I have come up with a few questions:
> 
> The first thing I'm doing is reading lines from an access.log with
> scan-file and parsing these into clos-objects:
> 
>   (defun make-access-series (path)
>      (choose-if (complement #'null)
>            (map-fn t #'access-parse
>                    (scan-file path #'read-line))))

I think you want

(defun make-access-series (path)
   (declare (optimizable-series-function))
   (choose-if (complement #'null)
              (map-fn t #'access-parse
                      (scan-file path #'read-line))))

> access-parse parses a line, does some simple filtering (to avoid
> making uneccessary objects) and returns an access-log object.
> 
> To test this, I typically do something like:
> 
>   (subseries (make-access-series #P"log:access.log") 0 2)
> 
> Giving me the first two parsed access-log objects.
> 
> 1. Series are supposed to be lazy, which I though meant that in this
>    example only the first two lines would need to be read (given that
>    the access-parse function accepts these), but running this code
>    leads to heavy disk activity.  Have I misunderstood "lazy", or is
>    this implementation-dependent?  (I'm using series from CLOCC on
>    clisp)

Well it should read lines until two non-nil objects have been returned 
by ACCESS-PARSE. I'm not sure why your harddrive would have a hard time 
with that...

> 2. Once the parsing is debugged, the next step is to include this in
>    higher-order functions.  This means both a developement fase, where
>    I only need a small set of data (the same data over and over
>    will do nicely), and doing the full analysis, where I'll probably
>    want to run the same set of data through different aggregations.
>    In both these cases the reading and parsing really only needs to be
>    done once, and I wonder: how can I add caching to a given stage in
>    this analysis, and upon re-read of the series, read from the cache
>    instead of re-doing the read-parse?  If this was a shell-script, I
>    would do:
>      sed parselog.sed < logfile | tee log.tmp1 | analysis-1
>    and later on:
>      analysis-2 < log.tmp1
>    In other words: how can I make "tee" with a series?
>    (I imagine the CL solution can be a lot smarter than tee, having
>    implicit caching without the caller knowing, and automatically
>    continuing to read the original series once past the length of the
>    cache)

I'm not exactly sure what you want here. If you're talking about storing 
a series for later examination, you can't do that without losing all 
optimalisation Series provides. You can collect it into a data-structure 
or file though, and then later scan that to get a series again.

> 3. Some log-files have entries spanning several lines.  Is it possible
>    to make a function collecting a number of entries from an
>    input-series and having the result returned as a new series?  Like
>    chunk, but the number of entries in each chunk is not known in
>    advance.

You can use PRODUCING to create off-line transducers. I think that's 
what you want.


You may want to read

AIM-1082 - Optimization of Series Expressions: Part I: User's Manual for 
the Series Macro Package

ftp://publications.ai.mit.edu/ai-publications/pdf/AIM-1082.pdf

for more information on how to use Series.

Regards,

Dirk Gerrits

From: Rolf Rander Næss
Subject: Re: using SERIES for log-parsing
Date: Mon, 12 Apr 2004 12:05:51 +0000
Message-ID: <ds5hdvpz9i8.fsf@bacchus.pvv.ntnu.no>

Dirk Gerrits <····@dirkgerrits.com> writes:
> Rolf Rander N�ss wrote:
>> Hi, I have just started looking into series, making a small system for
>> analyzing debug-logs, and I have come up with a few questions:
>> The first thing I'm doing is reading lines from an access.log with
>> scan-file and parsing these into clos-objects:
>>   (defun make-access-series (path)
>>      (choose-if (complement #'null)
>>            (map-fn t #'access-parse
>>                    (scan-file path #'read-line))))
>
> I think you want
>
> (defun make-access-series (path)
>    (declare (optimizable-series-function))
>    (choose-if (complement #'null)
>               (map-fn t #'access-parse
>                       (scan-file path #'read-line))))

This is the same advice Raymond Toy gave me(?).  This alone didn't
make any difference, but when I also wrapped the subseries-call into
an optimizable-series-function like this:

(defun access-series-first-2 (path)
  (declare (optimizable-series-function 1))
  (subseries (make-access-series path) 0 2))

it made a huge difference.

>> 2. Once the parsing is debugged, the next step is to include this in
>>    higher-order functions.  This means both a developement fase, where
>>    I only need a small set of data (the same data over and over
>>    will do nicely), and doing the full analysis, where I'll probably
>>    want to run the same set of data through different aggregations.
>>    In both these cases the reading and parsing really only needs to be
>>    done once, and I wonder: how can I add caching to a given stage in
>>    this analysis, and upon re-read of the series, read from the cache
>>    instead of re-doing the read-parse?  If this was a shell-script, I
>>    would do:
>>      sed parselog.sed < logfile | tee log.tmp1 | analysis-1
>>    and later on:
>>      analysis-2 < log.tmp1
>>    In other words: how can I make "tee" with a series?
>>    (I imagine the CL solution can be a lot smarter than tee, having
>>    implicit caching without the caller knowing, and automatically
>>    continuing to read the original series once past the length of the
>>    cache)
>
> I'm not exactly sure what you want here. If you're talking about
> storing a series for later examination, you can't do that without
> losing all optimalisation Series provides. You can collect it into a
> data-structure or file though, and then later scan that to get a
> series again.

Collecting results is what I want, but transparent to the syntax.
Something like:

  (subseries
     (light-analysis-1
        (caching-series *series-cache*
            (heavy-seriesmaking-function)))
     0 20)

which would store the first n results from the
heavy-seriesmaking-function (enough to compute subseries 0 to 20) in
the *series-cache*.  Such that a later call to for example:

  (subseries
     (light-analysis-2
        (caching-series *series-cache*
            (heavy-seriesmaking-function)))
     0 40)

would not need to re-calculate the heavy-seriesmaking-function for the
first 20 entries.  (This would ofcourse mandate storing some kind of
state for the heavy-seriesmaking-function).

While developing and debugging the analysis-functions, I would like to
operate on just a subseries, and avoid re-reading data from file each
time.  (In the code-test-debug-cycle, anything more than 10s is "too
long")

Anyway, reading the responses to my question, studying the series-doc
and looking at the macroexpansions of the function-definitions above,
it seems my caching-idea isn't really feasible, and it is possibly the
wrong way to approach series.

> AIM-1082 - Optimization of Series Expressions: Part I: User's Manual
> for the Series Macro Package
>
> ftp://publications.ai.mit.edu/ai-publications/pdf/AIM-1082.pdf

Yes, thank you.


rolf rander

-- 
http://www.pvv.org/~rolfn/

"Den som kun tar sp�k for sp�k og alvor kun alvorligt, han og hun har
 faktisk fattet begge deler d�rlig"   -- Piet Hein

From: Rahul Jain
Subject: Re: using SERIES for log-parsing
Date: Sat, 17 Apr 2004 22:43:05 +0000
Message-ID: <87pta65ipy.fsf@nyct.net>

··········@pvv.org (Rolf Rander N�ss) writes:

> Anyway, reading the responses to my question, studying the series-doc
> and looking at the macroexpansions of the function-definitions above,
> it seems my caching-idea isn't really feasible, and it is possibly the
> wrong way to approach series.

Not at all. Just bind the series to a local variable and pass it along
in both cases where you need to use it. You'll get compiler warnings
when you are creating a traversal which can't be compiled to a fast
loop.

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist