newbie: text parsing Q

From: Neil Conway
Subject: newbie: text parsing Q
Date: Thu, 26 Jul 2001 22:23:04 +0000
Message-ID: <3B60984D.4000109@klamath.dyndns.org>

Hi everyone,

I'm still very new to LISP. I've created this structure:

(defstruct line
     (date 0)
     (name "")
     (hits 0))

Now, I'm reading data from a file: the data is comma-delimited and each 
line should be used to create a separate structure. So a typical line 
would be like so:

100,abc,200

How is the "proper" way to process this line and use that data to create 
  the structure shown above?

Any help would be very gratefully appreciated.

TIA,

Neil

Re: newbie: text parsing Q Kent M Pitman
- Re: newbie: text parsing Q ········@hex.net
  - Re: newbie: text parsing Q Steve Long
  - Re: newbie: text parsing Q Christophe Rhodes
- Re: newbie: text parsing Q Neil Conway
  - Re: newbie: text parsing Q Friedrich Dominicus
    - Re: newbie: text parsing Q Neil Conway
      - Re: newbie: text parsing Q Friedrich Dominicus
        Re: newbie: text parsing Q Kent M Pitman
        Re: newbie: text parsing Q Friedrich Dominicus
        Re: newbie: text parsing Q Arseny Slobodjuck
        Re: newbie: text parsing Q Kent M Pitman

From: Kent M Pitman
Subject: Re: newbie: text parsing Q
Date: Fri, 27 Jul 2001 00:46:04 +0000
Message-ID: <sfwpuanb3n7.fsf@world.std.com>

Neil Conway <····@klamath.dyndns.org> writes:

> Hi everyone,
> 
> I'm still very new to LISP. I've created this structure:
> 
> (defstruct line
>      (date 0)
>      (name "")
>      (hits 0))
> 
> Now, I'm reading data from a file: the data is comma-delimited and each 
> line should be used to create a separate structure. So a typical line 
> would be like so:
> 
> 100,abc,200
> 
> How is the "proper" way to process this line and use that data to create 
>   the structure shown above?

The community has been discussing creating a standard utility for partitioning
a string.  It's easy to write, and everyone does.

See these operations in CLHS:

 WITH-OPEN-FILE - Executes a body of code with a stream open to a given file.
 READ-LINE - reads a line from a stream
 POSITION - searches for a character (e.g., comma) in a string (or substring)
            and tells you its position
 SUBSEQ - allows you to extract a subsequence (e.g., a substring) from a 
          sequence (e.g., a string)
 PARSE-INTEGER - allows you to convert a string to an integer

> Any help would be very gratefully appreciated.

Good luck.

Btw, if this is homework, you should have identified it as such.
People here don't like accidentally doing someone's homework.

Also, if you don't know what CLHS is, it's the Common Lisp HyperSpec,
the language definition.  See 
 http://www.xanalys.com/software_tools/reference/HyperSpec/FrontMatter/
The functions mentioned above can be found through the "Symbol Index"
in CLHS.

From: ········@hex.net
Subject: Re: newbie: text parsing Q
Date: Fri, 27 Jul 2001 01:31:58 +0000
Message-ID: <iw387.26655$PA1.3006848@news20.bellglobal.com>

Kent M Pitman <······@world.std.com> writes:
> Neil Conway <····@klamath.dyndns.org> writes:
> 
>> Hi everyone,
>> 
>> I'm still very new to LISP. I've created this structure:
>> 
>> (defstruct line
>>      (date 0)
>>      (name "")
>>      (hits 0))
>> 
>> Now, I'm reading data from a file: the data is comma-delimited and each 
>> line should be used to create a separate structure. So a typical line 
>> would be like so:
>> 
>> 100,abc,200
>> 
>> How is the "proper" way to process this line and use that data to create 
>>   the structure shown above?

> The community has been discussing creating a standard utility for
> partitioning a string.  It's easy to write, and everyone does.

What's _arguably_ irritating is that if it were so "standard," it
would be sort of nice to have some de-facto "standard" function around
to do this sort of thing, rather than always resorting to "It's easy
to write."

I'd take the contrary position that there _isn't_ a "standard" way of
doing this, because what gets done to the resulting fields _isn't_
uniform.

For instance, consider the example of:
   100,abc,200

It's obvious that there ought to be three "fields/slots/...", but it's
NOT obvious what exactly should be done with the contents.

Some questions to elicit the nature of the problem:

-> Is "100" a number?  
-> Of what type?  
-> Should it be validated to be such?
-> Or is it alphanumeric, to be treated as a string?
-> Should the function be set up to cope with floating point decimal
   values, too?
-> How about values in Base 16?
-> What if someone wants to have a field that contains a comma?  
-> What should be done if someone encloses the string in quote marks,
   as with the following line?
   "100","cde","456"
-> Should the quotes be stripped?  
-> Is the "100" [in quotes] a string, or a number?
-> What should happen if, on line 4, an invalid value is found?
-> Should the whole file get read into a 2-dimensional array of strings,
   and then processed from there to generate the desired structures?
-> Or should there be record-based processing?
-> Or even field-based processing?

There are enough distinct ways of answering these questions that
there's _not_ a decent way of giving a "standard" answer to the
question.

[Not too long ago, someone was asking much the same question about
handling "CDF" files in Guile Scheme for use with GnuCash; the above
set of issues are amongst the issues that popped up in the resultant
discussion...]

> See these operations in CLHS:

>  WITH-OPEN-FILE - Executes a body of code with a stream open to a given file.
>  READ-LINE - reads a line from a stream
>  POSITION - searches for a character (e.g., comma) in a string (or substring)
>             and tells you its position
>  SUBSEQ - allows you to extract a subsequence (e.g., a substring) from a 
>           sequence (e.g., a string)
>  PARSE-INTEGER - allows you to convert a string to an integer

>> Any help would be very gratefully appreciated.

> Good luck.

> Btw, if this is homework, you should have identified it as such.
> People here don't like accidentally doing someone's homework.

This doesn't have the feeling of being HW, but I've been wrong
before...

> Also, if you don't know what CLHS is, it's the Common Lisp
> HyperSpec, the language definition.  See
> http://www.xanalys.com/software_tools/reference/HyperSpec/FrontMatter/
> The functions mentioned above can be found through the "Symbol
> Index" in CLHS.

Good comments, and those functions are indeed amongst the most useful
for the purpose.
-- 
(concatenate 'string "cbbrowne" ·@acm.org")
http://vip.hyperusa.com/~cbbrowne/lisp.html
The average woman would rather have beauty than brains because the
average man can see better than he can think.

From: Steve Long
Subject: Re: newbie: text parsing Q
Date: Fri, 27 Jul 2001 04:37:02 +0000
Message-ID: <3B60EFEE.2BD0AC97@isomedia.com>

········@hex.net wrote:

>
> What's _arguably_ irritating is that if it were so "standard," it
> would be sort of nice to have some de-facto "standard" function around
> to do this sort of thing, rather than always resorting to "It's easy
> to write."

But then, I think that a language should just supply fundamental tools so that we
don't have
to carry around functionality that we may never use.

In any case, this is a pretty simple problem.

slong

From: Christophe Rhodes
Subject: Re: newbie: text parsing Q
Date: Fri, 27 Jul 2001 08:08:48 +0000
Message-ID: <sqd76mn69b.fsf@lambda.jesus.cam.ac.uk>

········@hex.net writes:

> Kent M Pitman <······@world.std.com> writes:
> > Neil Conway <····@klamath.dyndns.org> writes:
> > 
> >> Hi everyone,
> >> 
> >> I'm still very new to LISP. I've created this structure:
> >> 
> >> (defstruct line
> >>      (date 0)
> >>      (name "")
> >>      (hits 0))
> >> 
> >> Now, I'm reading data from a file: the data is comma-delimited and each 
> >> line should be used to create a separate structure. So a typical line 
> >> would be like so:
> >> 
> >> 100,abc,200
> >> 
> >> How is the "proper" way to process this line and use that data to create 
> >>   the structure shown above?
> 
> > The community has been discussing creating a standard utility for
> > partitioning a string.  It's easy to write, and everyone does.
> 
> What's _arguably_ irritating is that if it were so "standard," it
> would be sort of nice to have some de-facto "standard" function around
> to do this sort of thing, rather than always resorting to "It's easy
> to write."
> 
> I'd take the contrary position that there _isn't_ a "standard" way of
> doing this, because what gets done to the resulting fields _isn't_
> uniform.

Just to be clear, I take you to mean that there is no "standard" way
of parsing a line of characters into a structure rather than there is
no "standard" way of partitioning a string. Certainly even after
partitioning there needs to be post-processing based on prior
knowledge of the fields, but this is application dependent and as Kent
point out there are functions in CL for quite a few operations.

> [snip long list of issues]

Christophe
-- 
Jesus College, Cambridge, CB5 8BL                           +44 1223 510 299
http://www-jcsu.jesus.cam.ac.uk/~csr21/                  (defun pling-dollar 
(str schar arg) (first (last +))) (make-dispatch-macro-character #\! t)
(set-dispatch-macro-character #\! #\$ #'pling-dollar)

From: Neil Conway
Subject: Re: newbie: text parsing Q
Date: Fri, 27 Jul 2001 01:59:25 +0000
Message-ID: <3B60CB04.5070405@klamath.dyndns.org>

Kent M Pitman wrote:
> Good luck.

Thanks again for your help. My first significant Lisp program is nearly 
finished ;-)

> Btw, if this is homework, you should have identified it as such.
> People here don't like accidentally doing someone's homework.

It's not -- I'm learning Lisp because it's a fun language (plus, I'm 
taking CS at University once I finish high school and knowing Lisp will 
be a plus for AI).

Cheers,

Neil

From: Friedrich Dominicus
Subject: Re: newbie: text parsing Q
Date: Fri, 27 Jul 2001 05:30:11 +0000
Message-ID: <87r8v37xcs.fsf@frown.here>

Neil Conway <····@klamath.dyndns.org> writes:
> 
> > Btw, if this is homework, you should have identified it as such.
> > People here don't like accidentally doing someone's homework.
> 
> It's not -- I'm learning Lisp because it's a fun language (plus, I'm
> taking CS at University once I finish high school and knowing Lisp
> will be a plus for AI).
may I ask if it has to be a comma separated file? If you can make it
readable by read the problems have gone
e.g
saving
(defun save-line-struct (line-struct)
              (with-open-file (stream "line-struct.lisp"
                                      :direction :output
                                      :if-does-not-exist :create
                                      :if-exists :supersede)
                (print line-struct stream)
                (fresh-line stream)))

reading
(defun read-line-struct ()
              (with-open-file (stream "line-struct.lisp"
         (read stream))))

Usage:

CL-USER 36 : >(let (line) 
      (setf line (make-line :name "foo" :date 120301))
      (print (line-name line))
      (save-line-struct line)
      (setf line nil)
      (print line)
      (setf line (read-line-struct))
      (print (line-name line))
    line)


output:
"foo" 

NIL 
"foo" 
#S(LINE DATE 120301 NAME "foo" HITS 0)

Till then
Friedrich

From: Neil Conway
Subject: Re: newbie: text parsing Q
Date: Fri, 27 Jul 2001 23:28:14 +0000
Message-ID: <3B61F90E.7060506@klamath.dyndns.org>

Fernando wrote:
> On 27 Jul 2001 07:30:11 +0200, Friedrich Dominicus
> <·····@q-software-solutions.com> wrote:
>>may I ask if it has to be a comma separated file? If you can make it
>>readable by read the problems have gone
>>
> Isn't this dangerous (the equivalent of buffer overflow problems with C)? O:-)

I'm sure it probably is; however, since this is an internal file format 
(generated based on some usage statistics), it would probably be fairly 
easy to write a simple Lisp program to translate the CSV input data into 
a format that can be fed to read, and then a second program to read this 
simplified data format.

Of course, that's completely different than, say, accepting arbitrary 
data from the network and feeding it into read ;-)

Thanks for the idea.

Cheers,

Neil

From: Friedrich Dominicus
Subject: Re: newbie: text parsing Q
Date: Mon, 30 Jul 2001 06:16:45 +0000
Message-ID: <87ae1naqlu.fsf@frown.here>

Neil Conway <····@klamath.dyndns.org> writes:

> Fernando wrote:
> > On 27 Jul 2001 07:30:11 +0200, Friedrich Dominicus
> > <·····@q-software-solutions.com> wrote:
> >>may I ask if it has to be a comma separated file? If you can make it
> >>readable by read the problems have gone
> >>
> > Isn't this dangerous (the equivalent of buffer overflow problems with C)? O:-)
> 
> I'm sure it probably is; however, since this is an internal file
> format (generated based on some usage statistics),

Now what do you think is read used for? All Lisp Programs run through
it, sometime. So I guess there may be things to fool it. Anyway I do
not have seen any buffer overflow in a Lisp Program till now. So for
me the choice is clear. If I can make it readable by read, I'm done.

Regards
Friedrich

From: Kent M Pitman
Subject: Re: newbie: text parsing Q
Date: Mon, 30 Jul 2001 09:57:01 +0000
Message-ID: <sfw8zh668pe.fsf@world.std.com>

Friedrich Dominicus <·····@q-software-solutions.com> writes:

> 
> Neil Conway <····@klamath.dyndns.org> writes:
> 
> > Fernando wrote:
> > > On 27 Jul 2001 07:30:11 +0200, Friedrich Dominicus
> > > <·····@q-software-solutions.com> wrote:
> > >>may I ask if it has to be a comma separated file? If you can make it
> > >>readable by read the problems have gone
> > >>
> > > Isn't this dangerous (the equivalent of buffer overflow problems with C)? O:-)
> > 
> > I'm sure it probably is; however, since this is an internal file
> > format (generated based on some usage statistics),
> 
> 
> Now what do you think is read used for? All Lisp Programs run through
> it, sometime. So I guess there may be things to fool it. Anyway I do
> not have seen any buffer overflow in a Lisp Program till now. So for
> me the choice is clear. If I can make it readable by read, I'm done.

Well, I took the remark to allow a more general kind of problem, for example
like trojan horses.  But use of WITH-STANDARD-IO-SYNTAX and binding 
*READ-EVAL* to NIL will largely fix that.

The real question is whether the person who wrote the data knew he was
using Lisp syntax.  e.g., if it's going to contain "3." and think it
means floating point because some other languages do, then READ is not
good.  Or if it thinks that # is a constituent character, so #3 might
be generated as a "symbol", that's not good.  Or if it thinks a string
with a newline can be written like "foo\nbar", that's not good. (The
last two are repairable with a suitable readtable; the "3." problem is
not portably repairable.)  In other words, you have to know the space
of data that you're going to read and make sure it's going to be
acceptable to READ both in practice and in future revs of the
product. In general, if it was produced by PRINT that's good for READ.
Or if it's output to some spec controlled by the person writing the
data so that it won't drift away from READ-able in a future rev of the
data-writer, that's good.  But if it's just "the foo database dumper
seems to write it in a form that looks like it might be
Lisp-compatible" then that's not good.

From: Friedrich Dominicus
Subject: Re: newbie: text parsing Q
Date: Mon, 30 Jul 2001 12:51:53 +0000
Message-ID: <878zh660ly.fsf@frown.here>

Kent M Pitman <······@world.std.com> writes:

> In general, if it was produced by PRINT that's good for READ.

Now I showed it while using print and therefor I do think I got my
idea promoted.


> Or if it's output to some spec controlled by the person writing the
> data so that it won't drift away from READ-able in a future rev of the
> data-writer, that's good.
Well that is what I asked. 

>But if it's just "the foo database dumper
> seems to write it in a form that looks like it might be
> Lisp-compatible" then that's not good.
I would think that is obvious.

Regards
Friedrich

From: Arseny Slobodjuck
Subject: Re: newbie: text parsing Q
Date: Mon, 30 Jul 2001 10:44:27 +0000
Message-ID: <3b653929.457497@news.vtc.ru>

On 30 Jul 2001 08:16:45 +0200, Friedrich Dominicus
<·····@q-software-solutions.com> wrote:

>> > Isn't this dangerous (the equivalent of buffer overflow problems with C)? O:-)
> Anyway I do
>not have seen any buffer overflow in a Lisp Program till now. So for
>me the choice is clear. If I can make it readable by read, I'm done.
Look:

[4]> (defun destroy-computer(&optional with-smoke)
        (if with-smoke (format t "destroying computer with smoke~%")
                         (format t "destroying. no smoke~%")))
DESTROY-COMPUTER
[5]> (read)
#.(destroy-computer t)
destroying computer with smoke
NIL
[6]>

From: Kent M Pitman
Subject: Re: newbie: text parsing Q
Date: Mon, 30 Jul 2001 10:15:27 +0000
Message-ID: <sfwae1m205c.fsf@world.std.com>

····@crosswinds.net (Arseny Slobodjuck) writes:

> > Anyway I do
> >not have seen any buffer overflow in a Lisp Program till now. So for
> >me the choice is clear. If I can make it readable by read, I'm done.
> Look:
> 
> [4]> (defun destroy-computer(&optional with-smoke)
>         (if with-smoke (format t "destroying computer with smoke~%")
>                          (format t "destroying. no smoke~%")))
> DESTROY-COMPUTER
> [5]> (read)
> #.(destroy-computer t)
> destroying computer with smoke
> NIL
> [6]>

(with-standard-io-syntax
  (let ((*read-eval* nil)) 
    (read)))