parsing a CSV line

From: Jock Cooper
Subject: parsing a CSV line
Date: Mon, 20 Sep 2004 19:36:32 +0000
Message-ID: <m34qls4trz.fsf@jcooper02.sagepub.com>

Parsing a CSV line is nearly trivial, but can be a pain to handle the 
quote text qualifier and backslashes.  Here is some code I wrote but
was wondering if there might be a more elegant or shorter way:

(defun get-csv-line (line)
  (when line
    (let ((char-list (coerce line 'list))
	  (current nil)
	  (list-to-build nil))
      (labels ((next-char () (pop char-list))
	       (get-event ()
		 (let ((ch (next-char)))
		   (cond ((null ch) (values :end nil))
			 ((char= ch #\") (values :is-quote #\"))
			 ((char= ch #\,) (values :is-comma #\,))
			 ((char= ch #\\) (values :is-backslash #\\))
			 (t (values :normal ch)))))
	       (field-finished ()
		 (push (coerce (nreverse current) 'string) list-to-build)))
	  (block loop
	    (tagbody
	     start
	      (multiple-value-bind (event ch) (get-event)
		(ecase event
		  (:is-backslash (go backslash-start) )
		  (:normal
		   (push ch current)
		   (go start))
		  (:is-quote
		   (go quoted))
		  (:is-comma
		   (field-finished)
		   (setq current nil)
		   (go start))
		  (:end
		   (field-finished)
		   (go exit))
		  ))
	     backslash-start
	      (multiple-value-bind (event ch) (get-event)
		(declare (ignore event))
		(push ch current)
		(go start))
	     quoted
	      (multiple-value-bind (event ch) (get-event)
		(ecase event
		  (:is-backslash (go backslash-quoted))
		  (:normal
		   (push ch current)
		   (go quoted))
		  (:is-quote
		   (go start))
		  (:is-comma
		   (push ch current)
		   (go quoted))
		  (:end
		   (field-finished)
		   (go exit))
		  ))
	     backslash-quoted
	      (multiple-value-bind (event ch) (get-event)
		(declare (ignore event))
		(push ch current)
		(go quoted))
	     exit
	      (return-from loop (nreverse list-to-build))))))))

Any comments on the above code?

Jock Cooper
--
http://www.fractal-recursions.com

Re: parsing a CSV line Rob Warnock
Re: parsing a CSV line Jeff
- Re: parsing a CSV line Jock Cooper
Re: parsing a CSV line Stefan Scholl
- Re: parsing a CSV line Pascal Bourguignon
  - Re: parsing a CSV line Jens Axel Søgaard
    - Re: parsing a CSV line Jock Cooper
      - Re: parsing a CSV line Alain Picard
        Re: parsing a CSV line Wade Humeniuk
  - Re: parsing a CSV line Jens Axel Søgaard
    - Re: parsing a CSV line Björn Lindberg
      - Re: parsing a CSV line Christophe Rhodes
      - Re: parsing a CSV line Pascal Bourguignon
      - Re: parsing a CSV line Kalle Olavi Niemitalo
Re: parsing a CSV line John Thingstad
- Re: parsing a CSV line Jock Cooper
Re: parsing a CSV line Edi Weitz

From: Rob Warnock
Subject: Re: parsing a CSV line
Date: Tue, 21 Sep 2004 02:38:08 +0000
Message-ID: <nsKdnbiNlOuNCdLcRVn-pQ@speakeasy.net>

Jock Cooper  <·····@mail.com> wrote:
+---------------
| Parsing a CSV line is nearly trivial, but can be a pain to handle the 
| quote text qualifier and backslashes.  Here is some code I wrote but
| was wondering if there might be a more elegant or shorter way:
| 
| (defun get-csv-line (line)
|   ...[61 lines omitted...)
+---------------

Here's mine, which I've been using for some time. It's about half
the length of yours (not that LOC is a good measure of anything),
uses a slightly simpler parser (though I admit I went through
*several* revisions before this version), and uses LOOP...ACROSS
to avoid having to break the string into a list of characters
up-front (though it probably still conses an equivalent amount):

;;; PARSE-CSV-LINE -- ···············@rpw3.org
;;; Parse one CSV line into a list of fields, ignoring comment
;;; lines, stripping quotes and field-internal escape characters.
;;; Lexical states: '(normal quoted escaped quoted+escaped)
;;;
(defun parse-csv-line (line)
  (when (or (string= line "")		; special-case blank lines
	    (char= #\# (char line 0)))	; or those starting with "#"
    (return-from parse-csv-line '()))
  (loop for c across line
	with state = 'normal
	and results = '()
	and chars = '() do
    (ecase state
      ((normal)
       (case c
         ((#\") (setq state 'quoted))
         ((#\\) (setq state 'escaped))
         ((#\,)
          (push (coerce (nreverse chars) 'string) results)
	  (setq chars '()))
         (t (push c chars))))
      ((quoted)
       (case c
         ((#\") (setq state 'normal))
         ((#\\) (setq state 'quoted+escaped))
         (t (push c chars))))
      ((escaped) (push c chars) (setq state 'normal))
      ((quoted+escaped) (push c chars) (setq state 'quoted)))
    finally
     (progn
       (push (coerce (nreverse chars) 'string) results)	; close open field
       (return (nreverse results)))))


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: Jeff
Subject: Re: parsing a CSV line
Date: Mon, 20 Sep 2004 22:41:26 +0000
Message-ID: <q0J3d.338478$8_6.324074@attbi_s04>

Jock Cooper wrote:

> Parsing a CSV line is nearly trivial, but can be a pain to handle the 
> quote text qualifier and backslashes.  Here is some code I wrote but
> was wondering if there might be a more elegant or shorter way:

Using the Lisp reader, this seems easy enough:

(defun parse-csv (text &optional (delim #\,) empty)
  (if (zerop (length text))
      nil
    (if (char= (char text 0) delim)
        (if empty
            (cons nil (parse-csv (subseq text 1) delim t))
          (parse-csv (subseq text 1) delim t))
      (multiple-value-bind (obj index) 
          (read-from-string text)
        (cons obj (parse-csv (subseq text index) delim))))))

You could easily wrap this around with-open-file and read-line to get
multiple lines from a file. It probably doesn't handle all
cases/problems, but from a typical, well-formatted csv file it should
do just fine.

Jeff

-- 
Progress in computer technology is waiting for people to stop
re-inventing the wheel.
 - Dr. William Bland (comp.lang.lisp)

From: Jock Cooper
Subject: Re: parsing a CSV line
Date: Tue, 21 Sep 2004 00:33:33 +0000
Message-ID: <m3vfe831gi.fsf@jcooper02.sagepub.com>

"Jeff" <···@nospam.insightbb.com> writes:

> Jock Cooper wrote:
> 
> > Parsing a CSV line is nearly trivial, but can be a pain to handle the 
> > quote text qualifier and backslashes.  Here is some code I wrote but
> > was wondering if there might be a more elegant or shorter way:
> 
> Using the Lisp reader, this seems easy enough:
> 
> (defun parse-csv (text &optional (delim #\,) empty)
>   (if (zerop (length text))
>       nil
>     (if (char= (char text 0) delim)
>         (if empty
>             (cons nil (parse-csv (subseq text 1) delim t))
>           (parse-csv (subseq text 1) delim t))
>       (multiple-value-bind (obj index) 
>           (read-from-string text)
>         (cons obj (parse-csv (subseq text index) delim))))))
> 
> You could easily wrap this around with-open-file and read-line to get
> multiple lines from a file. It probably doesn't handle all
> cases/problems, but from a typical, well-formatted csv file it should
> do just fine.
> 
> Jeff
> 

Thanks, this is the kind of thing I was looking for. I'm in the habit
of writing state machines for this type of parsing.

From: Stefan Scholl
Subject: Re: parsing a CSV line
Date: Mon, 20 Sep 2004 21:09:21 +0000
Message-ID: <1tss2eifgqvak$.dlg@parsec.no-spoon.de>

On 2004-09-20 21:36:32, Jock Cooper wrote:
> 			 ((char= ch #\,) (values :is-comma #\,))

> 
> Any comments on the above code?

The German version of Excel uses a semicolon instead of comma.

From: Pascal Bourguignon
Subject: Re: parsing a CSV line
Date: Mon, 20 Sep 2004 21:48:14 +0000
Message-ID: <87mzzkipcx.fsf@thalassa.informatimago.com>

Stefan Scholl <······@no-spoon.de> writes:

> On 2004-09-20 21:36:32, Jock Cooper wrote:
> > 			 ((char= ch #\,) (values :is-comma #\,))
> 
> > 
> > Any comments on the above code?
> 
> The German version of Excel uses a semicolon instead of comma.

Ouch!

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we.

From: Jens Axel Søgaard
Subject: Re: parsing a CSV line
Date: Mon, 20 Sep 2004 22:06:56 +0000
Message-ID: <414f547c$0$275$edfadb0f@dread11.news.tele.dk>

Pascal Bourguignon wrote:
> Stefan Scholl <······@no-spoon.de> writes:
>>On 2004-09-20 21:36:32, Jock Cooper wrote:
>>
>>>			 ((char= ch #\,) (values :is-comma #\,))
>>
>>>Any comments on the above code?
>>The German version of Excel uses a semicolon instead of comma.
> Ouch!

If only thatwas the only problem with csv files. See Neil van
Dyke's csv library and it's list of references to get an
impression of the many ways a csv file can be formatted.

<http://planet.plt-scheme.org/docs/neil/csv.plt/1/0/doc.txt>

-- 
Jens Axel Søgaard

From: Jock Cooper
Subject: Re: parsing a CSV line
Date: Mon, 20 Sep 2004 23:00:18 +0000
Message-ID: <m3zn3k35rx.fsf@jcooper02.sagepub.com>

Jens Axel S�gaard <······@soegaard.net> writes:

> Pascal Bourguignon wrote:
> > Stefan Scholl <······@no-spoon.de> writes:
> >>On 2004-09-20 21:36:32, Jock Cooper wrote:
> >>
> >>>			 ((char= ch #\,) (values :is-comma #\,))
> >>
> >>>Any comments on the above code?
> >>The German version of Excel uses a semicolon instead of comma.
> > Ouch!
> 
> If only thatwas the only problem with csv files. See Neil van
> Dyke's csv library and it's list of references to get an
> impression of the many ways a csv file can be formatted.
> 
> <http://planet.plt-scheme.org/docs/neil/csv.plt/1/0/doc.txt>
> 
> -- 

Several of his options are trivial to add, but quote-doubling-escapes
and newline handling would expand the state machine a fair amount.

From: Alain Picard
Subject: Re: parsing a CSV line
Date: Wed, 22 Sep 2004 00:34:04 +0000
Message-ID: <87wtynta4j.fsf@memetrics.com>

Jock Cooper <·····@mail.com> writes:

> Several of his options are trivial to add, but quote-doubling-escapes
> and newline handling would expand the state machine a fair amount.

I wrote something which handles a lot of the "extras", it's at

http://members.optusnet.com.au/apicard/csv-parser.lisp

-- 
It would be difficult to construe        Larry Wall, in  article
this as a feature.			 <·····················@netlabs.com>

From: Wade Humeniuk
Subject: Re: parsing a CSV line
Date: Wed, 22 Sep 2004 15:05:38 +0000
Message-ID: <6xg4d.73688$KU5.54157@edtnps89>

Alain Picard wrote:

> I wrote something which handles a lot of the "extras", it's at
> 
> http://members.optusnet.com.au/apicard/csv-parser.lisp
> 

Thank you Alain,

I just happened to need it.  I've recently (in the last
2 weeks) written a web site for my kid's school council collecting
volunteer information.
I am on the executive so I got to decide how it was to be done.
The volunteer coordinators wanted the information
available in Excel.  So I now export the information in CSV form.
Your program worked just fine, they have started to use the
results this morning.

FWIW the web site is running on
FreeBSD+CMUCL+Portableaserve+CL-SMTP and now +csv-parser.
There is no seperate database component.  All data is stored in
files as readable Lisp data.  I would give the site url but
the site needs authentication to prevent non school parents
from accessing the site.

The only difficulty is that some of my data is contained
in nested lists.  write-csv-line only takes lists containing
strings, numbers and symbols.  I understand the reasoning,
so as a work around I had to write the elements that were lists
into strings before calling write-csv-line.

Thanks again, Alain.

Wade

From: Jens Axel Søgaard
Subject: Re: parsing a CSV line
Date: Wed, 22 Sep 2004 14:20:50 +0000
Message-ID: <41518a3d$0$242$edfadb0f@dread11.news.tele.dk>

Ingvar wrote:
> Pascal Bourguignon <····@mouse-potato.com> writes:
>>Stefan Scholl <······@no-spoon.de> writes:

>>>  The German version of Excel uses a semicolon instead of comma.
>> Ouch!
> It's probably necessary from representational choices.

> I'd be surpised if the Swedish version uses "," as a field separator,
> as thta is (normally) used to separate the integer and the fractional
> part in decimal representation (normal Swedish for 1/4 is "0,25" as a
> decimal fraction).

I experienced a buggy program once. When the program saved data it
used the locale (so floating point number were written 3,1415),
but the loader didn't use the locale and thus expected 3.1415.
To succesfully save data, one had to change the locale before
running the program.

-- 
Jens Axel Søgaard

From: Björn Lindberg
Subject: Re: parsing a CSV line
Date: Wed, 22 Sep 2004 14:40:26 +0000
Message-ID: <hcsmzzibc4l.fsf@my.nada.kth.se>

Jens Axel S�gaard <······@soegaard.net> writes:

> Ingvar wrote:
> > Pascal Bourguignon <····@mouse-potato.com> writes:
> >>Stefan Scholl <······@no-spoon.de> writes:
> 
> >>>  The German version of Excel uses a semicolon instead of comma.
> >> Ouch!
> > It's probably necessary from representational choices.
> 
> > I'd be surpised if the Swedish version uses "," as a field separator,
> > as thta is (normally) used to separate the integer and the fractional
> > part in decimal representation (normal Swedish for 1/4 is "0,25" as a
> > decimal fraction).
> 
> I experienced a buggy program once. When the program saved data it
> used the locale (so floating point number were written 3,1415),
> but the loader didn't use the locale and thus expected 3.1415.
> To succesfully save data, one had to change the locale before
> running the program.

Speaking of which, would a CL which took locale into account to print
and read numbers using comma as the decimal separator be conforming to
ANSI?


Bj�rn

From: Christophe Rhodes
Subject: Re: parsing a CSV line
Date: Wed, 22 Sep 2004 21:49:41 +0000
Message-ID: <sqd60easae.fsf@cam.ac.uk>

·······@nada.kth.se (Björn Lindberg) writes:

> Speaking of which, would a CL which took locale into account to print
> and read numbers using comma as the decimal separator be conforming to
> ANSI?

If it required user intervention (a call to a non-standardized
operator, for instance) to put it into locale-specific mode, probably.
If not, probably not.  This should probably be viewed as analogous to
a C program being required to call setlocale(LC_ALL, "") to observe
locale settings.

Christophe

From: Pascal Bourguignon
Subject: Re: parsing a CSV line
Date: Wed, 22 Sep 2004 21:58:13 +0000
Message-ID: <876566ezka.fsf@thalassa.informatimago.com>

·······@nada.kth.se (Bj�rn Lindberg) writes:

> Jens Axel S�gaard <······@soegaard.net> writes:
> 
> > Ingvar wrote:
> > > Pascal Bourguignon <····@mouse-potato.com> writes:
> > >>Stefan Scholl <······@no-spoon.de> writes:
> > 
> > >>>  The German version of Excel uses a semicolon instead of comma.
> > >> Ouch!
> > > It's probably necessary from representational choices.
> > 
> > > I'd be surpised if the Swedish version uses "," as a field separator,
> > > as thta is (normally) used to separate the integer and the fractional
> > > part in decimal representation (normal Swedish for 1/4 is "0,25" as a
> > > decimal fraction).
> > 
> > I experienced a buggy program once. When the program saved data it
> > used the locale (so floating point number were written 3,1415),
> > but the loader didn't use the locale and thus expected 3.1415.
> > To succesfully save data, one had to change the locale before
> > running the program.
> 
> Speaking of which, would a CL which took locale into account to print
> and read numbers using comma as the decimal separator be conforming to
> ANSI?

No. 

The formatter is specified to use American English to write numbers with ~R.
There was a bug in clisp for it used British English...

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we.

From: Kalle Olavi Niemitalo
Subject: Re: parsing a CSV line
Date: Wed, 22 Sep 2004 21:57:06 +0000
Message-ID: <874qlqt1al.fsf@Astalo.kon.iki.fi>

·······@nada.kth.se (Bj�rn Lindberg) writes:

> Speaking of which, would a CL which took locale into account to print
> and read numbers using comma as the decimal separator be conforming to
> ANSI?

Not in the default mode: a token that contains a comma cannot be
a potential number (2.3.1.1) and must thus be interpreted as a
symbol (2.3.4).  However, I think locale-specific parsing could
be enabled with an implementation-dependent variable.
WITH-STANDARD-IO-SYNTAX would then have to bind this variable.

From: John Thingstad
Subject: Re: parsing a CSV line
Date: Mon, 20 Sep 2004 23:21:15 +0000
Message-ID: <opsene5pv7pqzri1@mjolner.upc.no>

On 20 Sep 2004 12:36:32 -0700, Jock Cooper <·····@mail.com> wrote:

> Parsing a CSV line is nearly trivial, but can be a pain to handle the
> quote text qualifier and backslashes.  Here is some code I wrote but
> was wondering if there might be a more elegant or shorter way:
>
>
> Any comments on the above code?
>
> Jock Cooper
> --
> http://www.fractal-recursions.com

UGH! That is spagetti code. It also duplicates operatorons.
How about something more like this?
(untested)

;;;; GRAMMAR
;;;;
;;;; end   -> nil
;;;; quote -> \" (bs | [^\"])* \"
;;;; bs    -> \\ .
;;;; field -> (quote | bs | .)
;;;; line  ->  field (\,  field)* end
;;;;
;;;; -> means expands to
;;;; \ means take literary
;;;; . means any char
;;;; [^ ... ] means not chars in
;;;; ( ... | ... ) means or
;;;; ( ... ) means 0 or more times repeat


(let (list field result)
   (defconstant +separator+ #\,)
   (defconstant +quote+ #\")
   (defconstant +litteral+ #\\)

   (defun next () (pop list))
   (defun parse-bs () (push field (next)))

   (defun parse-quote ()
     (let ((ch (next)))
       (loop while (and ch (char/= ch +quote+)) do
             (setf ch (next))
             (if (char= ch +litteral+)
               (parse-bs)
               (push ch field)))
       (when (not ch)
         (error "Parse error"))))

   (defun parse-field ()
     (let ((ch (next)))
       (loop while (and ch (char/= ch +separator)) do
             (case ch
               (+quote+ (parse-quote))
               (+literal+ (parse-bs))
               (othewise (push ch field))))
       (push ch list) ; read one ch to far
       (push (coerce (nreverse field) 'string) result)))

   (defun parse-csv (line)
     (setf list (coerce line 'list))
     (let ((ch (next))
	    (parse-field)	
           (loop while (and ch (char= ch +separator+) do
                 (parse-field)
                 (setf ch (next)))))))



-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

From: Jock Cooper
Subject: Re: parsing a CSV line
Date: Tue, 21 Sep 2004 00:44:37 +0000
Message-ID: <m3r7ow30y2.fsf@jcooper02.sagepub.com>

"John Thingstad" <··············@chello.no> writes:

> On 20 Sep 2004 12:36:32 -0700, Jock Cooper <·····@mail.com> wrote:
> 
> > Parsing a CSV line is nearly trivial, but can be a pain to handle the
> > quote text qualifier and backslashes.  Here is some code I wrote but
> > was wondering if there might be a more elegant or shorter way:
> >
> >
> > Any comments on the above code?
> >
> > Jock Cooper
> > --
> > http://www.fractal-recursions.com
> 
> UGH! That is spagetti code. It also duplicates operatorons.

Well it may be spagetti code of sorts, but it is a state machine which
I consider to be a legitimate and useful programming technique.

From: Edi Weitz
Subject: Re: parsing a CSV line
Date: Tue, 21 Sep 2004 06:15:00 +0000
Message-ID: <87d60g5esb.fsf@miles.agharta.de>

On 20 Sep 2004 12:36:32 -0700, Jock Cooper <·····@mail.com> wrote:

> Parsing a CSV line is nearly trivial, but can be a pain to handle
> the quote text qualifier and backslashes.

There'a also Alain Picard's code at

  <http://members.optusnet.com.au/apicard/csv-parser.lisp>

Edi.

-- 

"Lisp doesn't look any deader than usual to me."
(David Thornley, reply to a question older than most languages)

Real email: (replace (subseq ·········@agharta.de" 5) "edi")