Jock Cooper <·····@mail.com> wrote:
+---------------
| Parsing a CSV line is nearly trivial, but can be a pain to handle the
| quote text qualifier and backslashes. Here is some code I wrote but
| was wondering if there might be a more elegant or shorter way:
|
| (defun get-csv-line (line)
| ...[61 lines omitted...)
+---------------
Here's mine, which I've been using for some time. It's about half
the length of yours (not that LOC is a good measure of anything),
uses a slightly simpler parser (though I admit I went through
*several* revisions before this version), and uses LOOP...ACROSS
to avoid having to break the string into a list of characters
up-front (though it probably still conses an equivalent amount):
;;; PARSE-CSV-LINE -- ···············@rpw3.org
;;; Parse one CSV line into a list of fields, ignoring comment
;;; lines, stripping quotes and field-internal escape characters.
;;; Lexical states: '(normal quoted escaped quoted+escaped)
;;;
(defun parse-csv-line (line)
(when (or (string= line "") ; special-case blank lines
(char= #\# (char line 0))) ; or those starting with "#"
(return-from parse-csv-line '()))
(loop for c across line
with state = 'normal
and results = '()
and chars = '() do
(ecase state
((normal)
(case c
((#\") (setq state 'quoted))
((#\\) (setq state 'escaped))
((#\,)
(push (coerce (nreverse chars) 'string) results)
(setq chars '()))
(t (push c chars))))
((quoted)
(case c
((#\") (setq state 'normal))
((#\\) (setq state 'quoted+escaped))
(t (push c chars))))
((escaped) (push c chars) (setq state 'normal))
((quoted+escaped) (push c chars) (setq state 'quoted)))
finally
(progn
(push (coerce (nreverse chars) 'string) results) ; close open field
(return (nreverse results)))))
-Rob
-----
Rob Warnock <····@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607
Jock Cooper wrote:
> Parsing a CSV line is nearly trivial, but can be a pain to handle the
> quote text qualifier and backslashes. Here is some code I wrote but
> was wondering if there might be a more elegant or shorter way:
Using the Lisp reader, this seems easy enough:
(defun parse-csv (text &optional (delim #\,) empty)
(if (zerop (length text))
nil
(if (char= (char text 0) delim)
(if empty
(cons nil (parse-csv (subseq text 1) delim t))
(parse-csv (subseq text 1) delim t))
(multiple-value-bind (obj index)
(read-from-string text)
(cons obj (parse-csv (subseq text index) delim))))))
You could easily wrap this around with-open-file and read-line to get
multiple lines from a file. It probably doesn't handle all
cases/problems, but from a typical, well-formatted csv file it should
do just fine.
Jeff
--
Progress in computer technology is waiting for people to stop
re-inventing the wheel.
- Dr. William Bland (comp.lang.lisp)
"Jeff" <···@nospam.insightbb.com> writes:
> Jock Cooper wrote:
>
> > Parsing a CSV line is nearly trivial, but can be a pain to handle the
> > quote text qualifier and backslashes. Here is some code I wrote but
> > was wondering if there might be a more elegant or shorter way:
>
> Using the Lisp reader, this seems easy enough:
>
> (defun parse-csv (text &optional (delim #\,) empty)
> (if (zerop (length text))
> nil
> (if (char= (char text 0) delim)
> (if empty
> (cons nil (parse-csv (subseq text 1) delim t))
> (parse-csv (subseq text 1) delim t))
> (multiple-value-bind (obj index)
> (read-from-string text)
> (cons obj (parse-csv (subseq text index) delim))))))
>
> You could easily wrap this around with-open-file and read-line to get
> multiple lines from a file. It probably doesn't handle all
> cases/problems, but from a typical, well-formatted csv file it should
> do just fine.
>
> Jeff
>
Thanks, this is the kind of thing I was looking for. I'm in the habit
of writing state machines for this type of parsing.
On 2004-09-20 21:36:32, Jock Cooper wrote:
> ((char= ch #\,) (values :is-comma #\,))
>
> Any comments on the above code?
The German version of Excel uses a semicolon instead of comma.
Stefan Scholl <······@no-spoon.de> writes:
> On 2004-09-20 21:36:32, Jock Cooper wrote:
> > ((char= ch #\,) (values :is-comma #\,))
>
> >
> > Any comments on the above code?
>
> The German version of Excel uses a semicolon instead of comma.
Ouch!
--
__Pascal Bourguignon__ http://www.informatimago.com/
Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we.
Pascal Bourguignon wrote:
> Stefan Scholl <······@no-spoon.de> writes:
>>On 2004-09-20 21:36:32, Jock Cooper wrote:
>>
>>> ((char= ch #\,) (values :is-comma #\,))
>>
>>>Any comments on the above code?
>>The German version of Excel uses a semicolon instead of comma.
> Ouch!
If only thatwas the only problem with csv files. See Neil van
Dyke's csv library and it's list of references to get an
impression of the many ways a csv file can be formatted.
<http://planet.plt-scheme.org/docs/neil/csv.plt/1/0/doc.txt>
--
Jens Axel Søgaard
Jens Axel S�gaard <······@soegaard.net> writes:
> Pascal Bourguignon wrote:
> > Stefan Scholl <······@no-spoon.de> writes:
> >>On 2004-09-20 21:36:32, Jock Cooper wrote:
> >>
> >>> ((char= ch #\,) (values :is-comma #\,))
> >>
> >>>Any comments on the above code?
> >>The German version of Excel uses a semicolon instead of comma.
> > Ouch!
>
> If only thatwas the only problem with csv files. See Neil van
> Dyke's csv library and it's list of references to get an
> impression of the many ways a csv file can be formatted.
>
> <http://planet.plt-scheme.org/docs/neil/csv.plt/1/0/doc.txt>
>
> --
Several of his options are trivial to add, but quote-doubling-escapes
and newline handling would expand the state machine a fair amount.
Jock Cooper <·····@mail.com> writes:
> Several of his options are trivial to add, but quote-doubling-escapes
> and newline handling would expand the state machine a fair amount.
I wrote something which handles a lot of the "extras", it's at
http://members.optusnet.com.au/apicard/csv-parser.lisp
--
It would be difficult to construe Larry Wall, in article
this as a feature. <·····················@netlabs.com>
Alain Picard wrote:
> I wrote something which handles a lot of the "extras", it's at
>
> http://members.optusnet.com.au/apicard/csv-parser.lisp
>
Thank you Alain,
I just happened to need it. I've recently (in the last
2 weeks) written a web site for my kid's school council collecting
volunteer information.
I am on the executive so I got to decide how it was to be done.
The volunteer coordinators wanted the information
available in Excel. So I now export the information in CSV form.
Your program worked just fine, they have started to use the
results this morning.
FWIW the web site is running on
FreeBSD+CMUCL+Portableaserve+CL-SMTP and now +csv-parser.
There is no seperate database component. All data is stored in
files as readable Lisp data. I would give the site url but
the site needs authentication to prevent non school parents
from accessing the site.
The only difficulty is that some of my data is contained
in nested lists. write-csv-line only takes lists containing
strings, numbers and symbols. I understand the reasoning,
so as a work around I had to write the elements that were lists
into strings before calling write-csv-line.
Thanks again, Alain.
Wade
Ingvar wrote:
> Pascal Bourguignon <····@mouse-potato.com> writes:
>>Stefan Scholl <······@no-spoon.de> writes:
>>> The German version of Excel uses a semicolon instead of comma.
>> Ouch!
> It's probably necessary from representational choices.
> I'd be surpised if the Swedish version uses "," as a field separator,
> as thta is (normally) used to separate the integer and the fractional
> part in decimal representation (normal Swedish for 1/4 is "0,25" as a
> decimal fraction).
I experienced a buggy program once. When the program saved data it
used the locale (so floating point number were written 3,1415),
but the loader didn't use the locale and thus expected 3.1415.
To succesfully save data, one had to change the locale before
running the program.
--
Jens Axel Søgaard
Jens Axel S�gaard <······@soegaard.net> writes:
> Ingvar wrote:
> > Pascal Bourguignon <····@mouse-potato.com> writes:
> >>Stefan Scholl <······@no-spoon.de> writes:
>
> >>> The German version of Excel uses a semicolon instead of comma.
> >> Ouch!
> > It's probably necessary from representational choices.
>
> > I'd be surpised if the Swedish version uses "," as a field separator,
> > as thta is (normally) used to separate the integer and the fractional
> > part in decimal representation (normal Swedish for 1/4 is "0,25" as a
> > decimal fraction).
>
> I experienced a buggy program once. When the program saved data it
> used the locale (so floating point number were written 3,1415),
> but the loader didn't use the locale and thus expected 3.1415.
> To succesfully save data, one had to change the locale before
> running the program.
Speaking of which, would a CL which took locale into account to print
and read numbers using comma as the decimal separator be conforming to
ANSI?
Bj�rn
·······@nada.kth.se (Björn Lindberg) writes:
> Speaking of which, would a CL which took locale into account to print
> and read numbers using comma as the decimal separator be conforming to
> ANSI?
If it required user intervention (a call to a non-standardized
operator, for instance) to put it into locale-specific mode, probably.
If not, probably not. This should probably be viewed as analogous to
a C program being required to call setlocale(LC_ALL, "") to observe
locale settings.
Christophe
·······@nada.kth.se (Bj�rn Lindberg) writes:
> Jens Axel S�gaard <······@soegaard.net> writes:
>
> > Ingvar wrote:
> > > Pascal Bourguignon <····@mouse-potato.com> writes:
> > >>Stefan Scholl <······@no-spoon.de> writes:
> >
> > >>> The German version of Excel uses a semicolon instead of comma.
> > >> Ouch!
> > > It's probably necessary from representational choices.
> >
> > > I'd be surpised if the Swedish version uses "," as a field separator,
> > > as thta is (normally) used to separate the integer and the fractional
> > > part in decimal representation (normal Swedish for 1/4 is "0,25" as a
> > > decimal fraction).
> >
> > I experienced a buggy program once. When the program saved data it
> > used the locale (so floating point number were written 3,1415),
> > but the loader didn't use the locale and thus expected 3.1415.
> > To succesfully save data, one had to change the locale before
> > running the program.
>
> Speaking of which, would a CL which took locale into account to print
> and read numbers using comma as the decimal separator be conforming to
> ANSI?
No.
The formatter is specified to use American English to write numbers with ~R.
There was a bug in clisp for it used British English...
--
__Pascal Bourguignon__ http://www.informatimago.com/
Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we.
·······@nada.kth.se (Bj�rn Lindberg) writes:
> Speaking of which, would a CL which took locale into account to print
> and read numbers using comma as the decimal separator be conforming to
> ANSI?
Not in the default mode: a token that contains a comma cannot be
a potential number (2.3.1.1) and must thus be interpreted as a
symbol (2.3.4). However, I think locale-specific parsing could
be enabled with an implementation-dependent variable.
WITH-STANDARD-IO-SYNTAX would then have to bind this variable.
On 20 Sep 2004 12:36:32 -0700, Jock Cooper <·····@mail.com> wrote:
> Parsing a CSV line is nearly trivial, but can be a pain to handle the
> quote text qualifier and backslashes. Here is some code I wrote but
> was wondering if there might be a more elegant or shorter way:
>
>
> Any comments on the above code?
>
> Jock Cooper
> --
> http://www.fractal-recursions.com
UGH! That is spagetti code. It also duplicates operatorons.
How about something more like this?
(untested)
;;;; GRAMMAR
;;;;
;;;; end -> nil
;;;; quote -> \" (bs | [^\"])* \"
;;;; bs -> \\ .
;;;; field -> (quote | bs | .)
;;;; line -> field (\, field)* end
;;;;
;;;; -> means expands to
;;;; \ means take literary
;;;; . means any char
;;;; [^ ... ] means not chars in
;;;; ( ... | ... ) means or
;;;; ( ... ) means 0 or more times repeat
(let (list field result)
(defconstant +separator+ #\,)
(defconstant +quote+ #\")
(defconstant +litteral+ #\\)
(defun next () (pop list))
(defun parse-bs () (push field (next)))
(defun parse-quote ()
(let ((ch (next)))
(loop while (and ch (char/= ch +quote+)) do
(setf ch (next))
(if (char= ch +litteral+)
(parse-bs)
(push ch field)))
(when (not ch)
(error "Parse error"))))
(defun parse-field ()
(let ((ch (next)))
(loop while (and ch (char/= ch +separator)) do
(case ch
(+quote+ (parse-quote))
(+literal+ (parse-bs))
(othewise (push ch field))))
(push ch list) ; read one ch to far
(push (coerce (nreverse field) 'string) result)))
(defun parse-csv (line)
(setf list (coerce line 'list))
(let ((ch (next))
(parse-field)
(loop while (and ch (char= ch +separator+) do
(parse-field)
(setf ch (next)))))))
--
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
"John Thingstad" <··············@chello.no> writes:
> On 20 Sep 2004 12:36:32 -0700, Jock Cooper <·····@mail.com> wrote:
>
> > Parsing a CSV line is nearly trivial, but can be a pain to handle the
> > quote text qualifier and backslashes. Here is some code I wrote but
> > was wondering if there might be a more elegant or shorter way:
> >
> >
> > Any comments on the above code?
> >
> > Jock Cooper
> > --
> > http://www.fractal-recursions.com
>
> UGH! That is spagetti code. It also duplicates operatorons.
Well it may be spagetti code of sorts, but it is a state machine which
I consider to be a legitimate and useful programming technique.
On 20 Sep 2004 12:36:32 -0700, Jock Cooper <·····@mail.com> wrote:
> Parsing a CSV line is nearly trivial, but can be a pain to handle
> the quote text qualifier and backslashes.
There'a also Alain Picard's code at
<http://members.optusnet.com.au/apicard/csv-parser.lisp>
Edi.
--
"Lisp doesn't look any deader than usual to me."
(David Thornley, reply to a question older than most languages)
Real email: (replace (subseq ·········@agharta.de" 5) "edi")