From: Kirk Job Sluder
Subject: newbie wants help:  Splitting delimited lines
Date: 
Message-ID: <87fywxfu59.fsf@debian.kirkjobsluder.is-a-geek.net>
I've been working my way through Practical Common Lisp (PCL) and would
like some advice on how to make this function better.

Currently I'm re-writing python scripts for my Discourse Analysis
research on chat an email within a system.  The log files are character
delimited, one message per line, but the last field on the line is
"dirty" with unescaped delimiters.  So a typical line might look
something like this (not an actual line):

userID^username^gender^site^messageDate^world^messagetext
2706^user^m^center^2004-03-01 09:21^chatWorld^Dirty text with ^caret.

What I want is a generalizable function that allows me to do something
like this:  

;;----
;;;the field list
(defparameter *field-list* '("userID" 
			     "username"
			     "gender"
			     "site"
			     "messageDate"
			     "world"
			     "messageText")
  "the field list for splitting and identifying fields")

(setf record (split-line #\^ line *field-list*))
(get-field "userID" record)
(get-field "gender" record)

;;---------

Because I'm dealing with different log file formats, I really want to be
able to reference field values by name rather than remember that the
message text is (nth 4 line) in one format, and (nth 6 line) in another
format. 

;;---------
;;This function splits the line into count number of fields 
;;tacking on the last field as a possibly "dirty" remainder.
(defun pythonic-split (split-char line count)
  "split the line using split-char to produce a maximum of count fields"
  (multiple-value-bind (result-list place)
      ;;subtract one from count so that you can pass the total number
      ;;of desired fields
      (split-sequence:split-sequence split-char line :count (- count 1))
  (append result-list `(,(subseq line place)))))

(defun split-line (split-char line labels)
  "split a line into an alist of length count with labels" 
  ;;a solution for matching fields to labels, use the length of the 
  ;;labels list to get the count.
  (pairlis labels (pythonic-split split-char line (length labels))))

(defun get-field (field record)
  "get a single field from the record"
  (cdr (assoc "userID" fields :test 'string=)))

;;-------

My choice of an associated list here is based on the relatively naive
impression that they are slighly better than hashes for short and
short-lived data structures.  Any advice would be very much
appreciated.  

-- 
Kirk Job-Sluder
"The square-jawed homunculi of Tommy Hilfinger ads make every day an
existential holocaust."  --Scary Go Round

From: Pascal Bourguignon
Subject: Re: newbie wants help:  Splitting delimited lines
Date: 
Message-ID: <87ekchw8j6.fsf@thalassa.informatimago.com>
Kirk Job Sluder <····@jobsluder.net> writes:

> I've been working my way through Practical Common Lisp (PCL) and would
> like some advice on how to make this function better.
>
> Currently I'm re-writing python scripts for my Discourse Analysis
> research on chat an email within a system.  The log files are character
> delimited, one message per line, but the last field on the line is
> "dirty" with unescaped delimiters.  So a typical line might look
> something like this (not an actual line):
>
> userID^username^gender^site^messageDate^world^messagetext
> 2706^user^m^center^2004-03-01 09:21^chatWorld^Dirty text with ^caret.
>
> What I want is a generalizable function that allows me to do something
> like this:  
>
> ;;----
> ;;;the field list
> (defparameter *field-list* '("userID" 
> 			     "username"
> 			     "gender"
> 			     "site"
> 			     "messageDate"
> 			     "world"
> 			     "messageText")
>   "the field list for splitting and identifying fields")
>
> (setf record (split-line #\^ line *field-list*))
> (get-field "userID" record)
> (get-field "gender" record)
>
> ;;---------


I would use keywords instead of strings for field names, because it's
more efficient to EQ two symbols than to STRING= two strings.

(defparameter *field-list* '(:userid 
                             :username
                             :gender
                             :site
                             :messagedate
                             :world
                             :messagetext)
  "the field list for splitting and identifying fields")

(defun get-field (field record)
  "get a single field from the record"
  (cdr (assoc field record)))
;; or: :test (function eq), but the default (function eql) is good enough.

(get-field :userid record)


> Because I'm dealing with different log file formats, I really want to be
> able to reference field values by name rather than remember that the
> message text is (nth 4 line) in one format, and (nth 6 line) in another
> format. 

Then I'd would even do without the assoc. and use defstruct which
defines accessors automatically:


(defmacro defstruct+split (name-and-options &rest fields)
  (let ((name (if (consp name-and-options)
                         (car name-and-options)
                         name-and-options))
        (split-char (or (second (find :split-char name-and-options
                                      :key (lambda (item) (if (consp item)
                                                       (car item)
                                                       item))))))
        (n&o (remove :split-char name-and-options
                     :key (lambda (item) (if (consp item)
                                                       (car item)
                                                       item)))))
  `(progn (defstruct ,n&o ,@fields)
          (defun ,(intern (format nil "PARSE-~A" name)) (line)
            (split-line ',split-char line ,(length fields))))))


(defun split-line (split-char line count)
   "split a line into a list of count elements" 
   (pythonic-split split-char line count))

(defstruct+split (log1 (:type list) (:split-char #\^))
  userid  username gender site messagedate world messagetext)

(defstruct+split (passwd (:type list)  (:split-char #\:))
  login password uid gid gecos home shell)

[251]> (log1-username l)
"Pascal Bourguignon"
[252]> (setf p (parse-passwd "pjb:x:1000:1000:Pascal Bourguignon:/home/pjb:/bin/bash"))
("pjb" "x" "1000" "1000" "Pascal Bourguignon" "/home/pjb" "/bin/bash")
[253]> (passwd-login p)
"pjb"



> ;;---------
> ;;This function splits the line into count number of fields 
> ;;tacking on the last field as a possibly "dirty" remainder.
> (defun pythonic-split (split-char line count)
>   "split the line using split-char to produce a maximum of count fields"
>   (multiple-value-bind (result-list place)
>       ;;subtract one from count so that you can pass the total number
>       ;;of desired fields
>       (split-sequence:split-sequence split-char line :count (- count 1))
>   (append result-list `(,(subseq line place)))))
>
> (defun split-line (split-char line labels)
>   "split a line into an alist of length count with labels" 
>   ;;a solution for matching fields to labels, use the length of the 
>   ;;labels list to get the count.
>   (pairlis labels (pythonic-split split-char line (length labels))))
>
> (defun get-field (field record)
>   "get a single field from the record"
>   (cdr (assoc "userID" fields :test 'string=)))
>
> ;;-------
>
> My choice of an associated list here is based on the relatively naive
> impression that they are slighly better than hashes for short and
> short-lived data structures.  Any advice would be very much
> appreciated.  

You're right, small lists are quite efficient.


Now we have the ease of use with these structure accessors.  If you
need the speed too, you could use vectors instead of lists:


(defstruct+split (passwd (:type vector) (:split-char #\:))
  login password uid gid gecos home shell)

[261]> (setf p (parse-passwd "pjb:x:1000:1000:Pascal Bourguignon:/home/pjb:/bin/bash"))

#("pjb" "x" "1000" "1000" "Pascal Bourguignon" "/home/pjb" "/bin/bash")
[262]> (passwd-login p)

"pjb"

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
I need a new toy.
Tail of black dog keeps good time.
Pounce! Good dog! Good dog!
From: Kirk Job Sluder
Subject: Re: newbie wants help:  Splitting delimited lines
Date: 
Message-ID: <87vf5tnquv.fsf@debian.kirkjobsluder.is-a-geek.net>
Pascal Bourguignon <···@informatimago.com> writes:

> Kirk Job Sluder <····@jobsluder.net> writes:

Ahh, thank you very very much, 

> I would use keywords instead of strings for field names, because it's
> more efficient to EQ two symbols than to STRING= two strings.

I was wondering this, but I didn't know that you could make a list of
keywords out of the context of a parameter list.  

> Then I'd would even do without the assoc. and use defstruct which
> defines accessors automatically:

I was thinking along these lines, and your example code helps quite a
bit.  One of my concerns with this is how would I access the structure
slots dynamically?  For example, something like:

(defun do-something (fieldname)
        ...
        (do-foo (struct-$fieldname$ baz))
        ...)

I suppose what I want is a function that given "foo" will produce slot
accessor function STRUCT-FOO.

> 
> -- 
> __Pascal Bourguignon__                     http://www.informatimago.com/
> I need a new toy.
> Tail of black dog keeps good time.
> Pounce! Good dog! Good dog!

-- 
Kirk Job-Sluder
"The square-jawed homunculi of Tommy Hilfinger ads make every day an
existential holocaust."  --Scary Go Round
From: Pascal Bourguignon
Subject: Re: newbie wants help:  Splitting delimited lines
Date: 
Message-ID: <8764xtw4c1.fsf@thalassa.informatimago.com>
Kirk Job Sluder <····@jobsluder.net> writes:

> Pascal Bourguignon <···@informatimago.com> writes:
>
>> Kirk Job Sluder <····@jobsluder.net> writes:
>
> Ahh, thank you very very much, 
>
>> I would use keywords instead of strings for field names, because it's
>> more efficient to EQ two symbols than to STRING= two strings.
>
> I was wondering this, but I didn't know that you could make a list of
> keywords out of the context of a parameter list.  

Keywords are symbols like other symbols (only they're in the KEYWORD
package and automatically self evaluating).


>> Then I'd would even do without the assoc. and use defstruct which
>> defines accessors automatically:
>
> I was thinking along these lines, and your example code helps quite a
> bit.  One of my concerns with this is how would I access the structure
> slots dynamically?  For example, something like:
>
> (defun do-something (fieldname)
>         ...
>         (do-foo (struct-$fieldname$ baz))
>         ...)
>
> I suppose what I want is a function that given "foo" will produce slot
> accessor function STRUCT-FOO.

Well, with normal structures, this is not possible (portably).
But here we don't have real structures, we have lists or vectors.
So you can apply the normal sequence functions on them:

(let ((record (parse-log1 "100^Pascal Bourguignon^M^La Manga del Mar Menor^2005-10-12^Mine^Tralal lalere.")))
     (dotimes (i (length record))  
       (print (elt record i))))

And, you can extend the macro at will:

(defmacro defstruct+split (name-and-options &rest fields)
  (flet ((optkey (item) (if (consp item) (car item) item)))
    (let ((name (if (consp name-and-options)
                  (car name-and-options)
                  name-and-options))
          (split-char (or (second (find :split-char name-and-options
                                        :key (function optkey)))))
          (n&o (remove :split-char name-and-options
                       :key (function optkey)))
          (type (second (find :type name-and-options
                              :key (function optkey)))))
      `(progn (defstruct ,n&o ,@fields)
              (defun ,(intern (format nil "PARSE-~A" name)) (line)
                ,(case type
                   ((null)
                    (error "You must use (:type list) or (:type vector)"))
                   ((list)
                    `(split-line ',split-char line ,(length fields)))
                   ((vector)
                    `(coerce (split-line ',split-char line
                                         ,(length fields)) 'vector))
                   (otherwise
                    (error "You must use (:type list) or (:type vector)"))))
              (defun ,(intern (format nil "~A-FIELD" name)) (record field)
                 (case field
                     ,@(let ((index -1))
                         (mapcar (lambda (field)
                                   `((,field) (elt record ,(incf index))))
                                 fields))
                   (otherwise (error "~A is not a field of ~A"
                                     field ',name))))
              ,@(let ((index -1))
                  (mapcar
                   (lambda (field)
                     `(defconstant ,(intern (format nil "+~A-~A+" name field))
                        ,(incf index))) fields))
              ',name))))

[303]> (setf l (parse-log1 "100^Pascal Bourguignon^M^La Manga del Mar Menor^2005-10-12^Mine^Tralal lalere."))

("100" "Pascal Bourguignon" "M" "La Manga del Mar Menor" "2005-10-12" "Mine"
 "Tralal lalere.")
[304]> (log1-field l 'username)

"Pascal Bourguignon"
[305]> (elt l +log1-username+)
"Pascal Bourguignon"
[306]> (setf p (parse-passwd "pjb:x:1000:1000:Pascal Bourguignon:/home/pjb:/bin/bash"))

#("pjb" "x" "1000" "1000" "Pascal Bourguignon" "/home/pjb" "/bin/bash")
[307]> (list (passwd-login p) (passwd-field p 'login) (elt p +passwd-login+))
("pjb" "pjb" "pjb")
[308]> 

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
I need a new toy.
Tail of black dog keeps good time.
Pounce! Good dog! Good dog!
From: Kirk Job Sluder
Subject: Re: newbie wants help:  Splitting delimited lines
Date: 
Message-ID: <87ekcgz5in.fsf@debian.kirkjobsluder.is-a-geek.net>
Thanks for the suggestion.  I don't know if I need to go there yet, but
the example code is nice.  

The same script using a struct runs at 1/3rd of the time of the same
code using an alist, so it was worthwhile for building furture scripts.

-- 
Kirk Job-Sluder
"The square-jawed homunculi of Tommy Hilfinger ads make every day an
existential holocaust."  --Scary Go Round