reading formatted text

From: ············@gmail.com
Subject: reading formatted text
Date: Tue, 11 Mar 2008 13:50:52 +0000
Message-ID: <8921e8c5-1642-4958-88be-fe0917d33e61@n58g2000hsf.googlegroups.com>

Hi,

Are there Lisp idioms to read formatted text ?
I know about the read-line function, but what else (apart from
painfully reading one char at a time, searching manually for word/
numbers bounaries) ?

-Nicolas

Re: reading formatted text William James
- Re: reading formatted text ············@gmail.com
  - Re: reading formatted text Matthias Benkard
    - Re: reading formatted text ············@gmail.com
      - Re: reading formatted text Pascal J. Bourguignon
        Re: reading formatted text William James
        Re: reading formatted text ············@gmail.com
        Re: reading formatted text William James
        Re: reading formatted text Rob Warnock
        Re: reading formatted text William James
        Re: reading formatted text Rob Warnock
        Re: reading formatted text Pascal J. Bourguignon
      - Re: reading formatted text Øyvin Halfdan Thuv
Re: reading formatted text Carl Taylor

From: William James
Subject: Re: reading formatted text
Date: Tue, 11 Mar 2008 14:23:31 +0000
Message-ID: <0f2ae19b-d993-44fe-b48d-7a6f4a66b557@b1g2000hsg.googlegroups.com>

On Mar 11, 8:50 am, ············@gmail.com wrote:
> Hi,
>
> Are there Lisp idioms to read formatted text ?
> I know about the read-line function, but what else (apart from
> painfully reading one char at a time, searching manually for word/
> numbers bounaries) ?
>
> -Nicolas

Give an example of the text you want to parse.

From: ············@gmail.com
Subject: Re: reading formatted text
Date: Tue, 11 Mar 2008 15:08:16 +0000
Message-ID: <502d3e37-0927-440e-bc2c-97ef0703cafa@p25g2000hsf.googlegroups.com>

On Mar 11, 3:23 pm, William James <·········@yahoo.com> wrote:
> On Mar 11, 8:50 am, ············@gmail.com wrote:
>
> > Hi,
>
> > Are there Lisp idioms to read formatted text ?
> > I know about the read-line function, but what else (apart from
> > painfully reading one char at a time, searching manually for word/
> > numbers bounaries) ?
>
> > -Nicolas
>
> Give an example of the text you want to parse.

I don't think about a special format, but one could consider usual
examples:
- a csv formatted text, ie line oriented records with (TAB|SPACE|
whatever) separator (with optionnal blank space in between)
- an application specific format, eg '<number>=word <*DIGIT> ...' like
in log files (httpd, auth.log, etc.)
- IPv4/IPv6 textual representation
- etc.

From: Matthias Benkard
Subject: Re: reading formatted text
Date: Tue, 11 Mar 2008 15:14:40 +0000
Message-ID: <b60193d6-9737-4489-8897-e6b4bf5b2159@h25g2000hsf.googlegroups.com>

> I don't think about a special format, but one could consider usual
> examples:
> - a csv formatted text, ie line oriented records with (TAB|SPACE|
> whatever) separator (with optionnal blank space in between)
> - an application specific format, eg '<number>=word <*DIGIT> ...' like
> in log files (httpd, auth.log, etc.)
> - IPv4/IPv6 textual representation
> - etc.

http://www.weitz.de/cl-ppcre/
http://www.cliki.net/parser

~ Matthias

From: ············@gmail.com
Subject: Re: reading formatted text
Date: Wed, 12 Mar 2008 08:28:50 +0000
Message-ID: <cbf06e25-d449-4f2c-b9c3-5b9719e63852@b1g2000hsg.googlegroups.com>

On Mar 11, 4:14 pm, Matthias Benkard <··········@gmail.com> wrote:
> > I don't think about a special format, but one could consider usual
> > examples:
> > - a csv formatted text, ie line oriented records with (TAB|SPACE|
> > whatever) separator (with optionnal blank space in between)
> > - an application specific format, eg '<number>=word <*DIGIT> ...' like
> > in log files (httpd, auth.log, etc.)
> > - IPv4/IPv6 textual representation
> > - etc.
>
> http://www.weitz.de/cl-ppcre/http://www.cliki.net/parser
>
> ~ Matthias

Well, I know there are some libraries, but that was not my question.
I'm just wondering if there are Lisp idioms for reading formatted text
or not. Are you all using cl-ppcre when reading text ?

Libraries listed in http://www.cliki.net/parser are all optimised for
parsing a given format and so use very specific code (as with puri
that makes use of specialized state machines instead of regular
expressions).

From: Pascal J. Bourguignon
Subject: Re: reading formatted text
Date: Wed, 12 Mar 2008 09:42:48 +0000
Message-ID: <7cwso8qnvr.fsf@pbourguignon.anevia.com>

············@gmail.com writes:

> On Mar 11, 4:14 pm, Matthias Benkard <··········@gmail.com> wrote:
>> > I don't think about a special format, but one could consider usual
>> > examples:
>> > - a csv formatted text, ie line oriented records with (TAB|SPACE|
>> > whatever) separator (with optionnal blank space in between)
>> > - an application specific format, eg '<number>=word <*DIGIT> ...' like
>> > in log files (httpd, auth.log, etc.)
>> > - IPv4/IPv6 textual representation
>> > - etc.
>>
>> http://www.weitz.de/cl-ppcre/http://www.cliki.net/parser
>>
>> ~ Matthias
>
> Well, I know there are some libraries, but that was not my question.
> I'm just wondering if there are Lisp idioms for reading formatted text
> or not. Are you all using cl-ppcre when reading text ?

Well there is no specific shortcut syntax like what you could find in
perl or awk in common-lisp (the package), so nothing idiomatic.  A
part, perhaps, the fact that we've got the lisp reader and lisp
printer which can print readably, so we could say that the idiomatic
way to format, print and read in lisp is that.

> Libraries listed in http://www.cliki.net/parser are all optimised for
> parsing a given format and so use very specific code (as with puri
> that makes use of specialized state machines instead of regular
> expressions).

Watch again, there are parser _generators_, they're parametrized by
the grammar of your formated input.

But the point is that most often, formated data has a grammer that is
so simple that a context free parser is overkill, and a mere regexp,
or even just reading the stuff sequentially is enough.

-- 
__Pascal Bourguignon__

From: William James
Subject: Re: reading formatted text
Date: Wed, 12 Mar 2008 09:51:53 +0000
Message-ID: <dc3b207d-5a41-4c6a-991a-12bf4fb2bd92@m3g2000hsc.googlegroups.com>

On Mar 12, 3:42 am, ····@informatimago.com (Pascal J. Bourguignon)
wrote:

> But the point is that most often, formated data has a grammer that is
> so simple that a context free parser is overkill, and a mere regexp,
> or even just reading the stuff sequentially is enough.
>
> --
> __Pascal Bourguignon__

Three fields:
He said "Stop, thief!" and collapsed.
88
x,y

As a CSV record:
"He said ""Stop, thief!"" and collapsed.",88,"x,y"

Is CSV like this easy to parse?

From: ············@gmail.com
Subject: Re: reading formatted text
Date: Wed, 12 Mar 2008 11:33:40 +0000
Message-ID: <3c02cb7a-48d9-45f6-a9ca-accc3b0e97e1@m3g2000hsc.googlegroups.com>

On Mar 12, 10:51 am, William James <·········@yahoo.com> wrote:
> On Mar 12, 3:42 am, ····@informatimago.com (Pascal J. Bourguignon)
> wrote:
>
> > But the point is that most often, formated data has a grammer that is
> > so simple that a context free parser is overkill, and a mere regexp,
> > or even just reading the stuff sequentially is enough.
>
> > --
> > __Pascal Bourguignon__
>
> Three fields:
> He said "Stop, thief!" and collapsed.
> 88
> x,y
>
> As a CSV record:
> "He said ""Stop, thief!"" and collapsed.",88,"x,y"
>
> Is CSV like this easy to parse?

No. But you should have take care of what you export and which
delimiter better suit your needs when exporting in a CSV format, don't
you ?

From: William James
Subject: Re: reading formatted text
Date: Thu, 13 Mar 2008 17:48:14 +0000
Message-ID: <68eb624f-14a5-4199-b2da-5573ea6c89fb@n58g2000hsf.googlegroups.com>

On Mar 12, 6:33 am, ············@gmail.com wrote:
> On Mar 12, 10:51 am, William James <·········@yahoo.com> wrote:
>
> > On Mar 12, 3:42 am, ····@informatimago.com (Pascal J. Bourguignon)
> > wrote:
>
> > > But the point is that most often, formated data has a grammer that is
> > > so simple that a context free parser is overkill, and a mere regexp,
> > > or even just reading the stuff sequentially is enough.
>
> > > --
> > > __Pascal Bourguignon__
>
> > Three fields:
> > He said "Stop, thief!" and collapsed.
> > 88
> > x,y
>
> > As a CSV record:
> > "He said ""Stop, thief!"" and collapsed.",88,"x,y"
>
> > Is CSV like this easy to parse?
>
> No. But you should have take care of what you export and which
> delimiter better suit your needs when exporting in a CSV format, don't
> you ?

You have to export the data you have.
The comma delimiter is fine; even binary data can be included in the
fields.

From: Rob Warnock
Subject: Re: reading formatted text
Date: Wed, 12 Mar 2008 12:15:46 +0000
Message-ID: <-Kednfu4GP3vVEranZ2dnUVZ_hynnZ2d@speakeasy.net>

William James  <·········@yahoo.com> wrote:
+---------------
| ····@informatimago.com (Pascal J. Bourguignon) wrote:
| > But the point is that most often, formated data has a grammer that is
| > so simple that a context free parser is overkill, and a mere regexp,
| > or even just reading the stuff sequentially is enough.
| 
| Three fields:
| He said "Stop, thief!" and collapsed.
| 88
| x,y
| 
| As a CSV record:
| "He said ""Stop, thief!"" and collapsed.",88,"x,y"
| 
| Is CSV like this easy to parse?
+---------------

Pretty much so. Here's what I use [may need tweaking for some applications]:

    ;;; PARSE-CSV-LINE -- Parse one CSV line into a list of fields,
    ;;; stripping quotes and field-internal escape characters.
    ;;; Lexical states: '(normal quoted escaped quoted+escaped)
    ;;;
    (defun parse-csv-line (line)
      (when (or (string= line "")           ; special-case blank lines
		(char= #\# (char line 0)))  ; or those starting with "#"
	(return-from parse-csv-line '()))
      (loop for c across line
	    with state = 'normal
	    and results = '()
	    and chars = '() do
	(ecase state
	  ((normal)
	   (case c
	     ((#\") (setq state 'quoted))
	     ((#\\) (setq state 'escaped))
	     ((#\,)
	      (push (coerce (nreverse chars) 'string) results)
	      (setq chars '()))
	     (t (push c chars))))
	  ((quoted)
	   (case c
	     ((#\") (setq state 'normal))
	     ((#\\) (setq state 'quoted+escaped))
	     (t (push c chars))))
	  ((escaped) (push c chars) (setq state 'normal))
	  ((quoted+escaped) (push c chars) (setq state 'quoted)))
	finally
	 (progn
	   (push (coerce (nreverse chars) 'string) results) ; close open field
	   (return (nreverse results)))))

It handles your sample input:

    > (parse-csv-line (read-line))
    "He said ""Stop, thief!"" and collapsed.",88,"x,y"

    ("He said Stop, thief! and collapsed." "88" "x,y")
    > 


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: William James
Subject: Re: reading formatted text
Date: Thu, 13 Mar 2008 18:11:43 +0000
Message-ID: <ef37272c-b750-4c73-9d51-53eebdb66c67@u69g2000hse.googlegroups.com>

On Mar 12, 7:15 am, ····@rpw3.org (Rob Warnock) wrote:
> William James  <·········@yahoo.com> wrote:
> +---------------
> | ····@informatimago.com (Pascal J. Bourguignon) wrote:
> | > But the point is that most often, formated data has a grammer that is
> | > so simple that a context free parser is overkill, and a mere regexp,
> | > or even just reading the stuff sequentially is enough.
> |
> | Three fields:
> | He said "Stop, thief!" and collapsed.
> | 88
> | x,y
> |
> | As a CSV record:
> | "He said ""Stop, thief!"" and collapsed.",88,"x,y"
> |
> | Is CSV like this easy to parse?
> +---------------
>
> Pretty much so. Here's what I use [may need tweaking for some applications]:
>
>     ;;; PARSE-CSV-LINE -- Parse one CSV line into a list of fields,
>     ;;; stripping quotes and field-internal escape characters.
>     ;;; Lexical states: '(normal quoted escaped quoted+escaped)
>     ;;;
>     (defun parse-csv-line (line)
>       (when (or (string= line "")           ; special-case blank lines
>                 (char= #\# (char line 0)))  ; or those starting with "#"
>         (return-from parse-csv-line '()))
>       (loop for c across line
>             with state = 'normal
>             and results = '()
>             and chars = '() do
>         (ecase state
>           ((normal)
>            (case c
>              ((#\") (setq state 'quoted))
>              ((#\\) (setq state 'escaped))
>              ((#\,)
>               (push (coerce (nreverse chars) 'string) results)
>               (setq chars '()))
>              (t (push c chars))))
>           ((quoted)
>            (case c
>              ((#\") (setq state 'normal))
>              ((#\\) (setq state 'quoted+escaped))
>              (t (push c chars))))
>           ((escaped) (push c chars) (setq state 'normal))
>           ((quoted+escaped) (push c chars) (setq state 'quoted)))
>         finally
>          (progn
>            (push (coerce (nreverse chars) 'string) results) ; close open field
>            (return (nreverse results)))))
>
> It handles your sample input:
>
>     > (parse-csv-line (read-line))
>     "He said ""Stop, thief!"" and collapsed.",88,"x,y"
>
>     ("He said Stop, thief! and collapsed." "88" "x,y")

I think that it should be
  ("He said \"Stop, thief!\" and collapsed." "88" "x,y")

A single regular expression with "captures" can parse csv records.
This is for records that actually have a comma as the field-separator.
To make parsing easier, append a comma to the string containing the
record; that way, every field ends with a comma.

\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),

This will match one field and its terminator.
Note that one opening parenthesis is followed by ?:.
This is merely to preventing capturing and is not
actually part of the search pattern.

If the field is enclosed in quotes, the second capture will be
nil; if not in quotes, the first capture will be nil.
As the final step, the first capture (if not nil) must have every
doubled quote ("") replaced with one (").

This regex should work with PCRE.

From: Rob Warnock
Subject: Re: reading formatted text
Date: Fri, 14 Mar 2008 03:39:58 +0000
Message-ID: <QP-dnf8ULa4TbkTanZ2dnUVZ_j2dnZ2d@speakeasy.net>

William James  <·········@yahoo.com> wrote:
+---------------
| ····@rpw3.org (Rob Warnock) wrote:
| > It handles your sample input:
| >     > (parse-csv-line (read-line))
| >     "He said ""Stop, thief!"" and collapsed.",88,"x,y"
| >
| >     ("He said Stop, thief! and collapsed." "88" "x,y")
| 
| I think that it should be
|   ("He said \"Stop, thief!\" and collapsed." "88" "x,y")
+---------------

Oops! Sorry, you're right. The specification I coded that to
didn't support doubling double-quotes to quote them. For the
data it was dealing with, it was considered a "feature" that
an input string like this:

    some "text, here," with "commas, yes?",more text here

would parse as this:

    ("some text, here, with commas, yes?" "more text here")

But as I said, "[it] may need tweaking for some applications"...  ;-}


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: Pascal J. Bourguignon
Subject: Re: reading formatted text
Date: Wed, 12 Mar 2008 13:35:57 +0000
Message-ID: <7cpru0qd36.fsf@pbourguignon.anevia.com>

William James <·········@yahoo.com> writes:

> On Mar 12, 3:42 am, ····@informatimago.com (Pascal J. Bourguignon)
> wrote:
>
>> But the point is that most often, formated data has a grammer that is
>> so simple that a context free parser is overkill, and a mere regexp,
>> or even just reading the stuff sequentially is enough.
>>
>> --
>> __Pascal Bourguignon__
>
> Three fields:
> He said "Stop, thief!" and collapsed.
> 88
> x,y
>
> As a CSV record:
> "He said ""Stop, thief!"" and collapsed.",88,"x,y"
>
> Is CSV like this easy to parse?

Yes.  The difficulty with CSV is that there are various variants.  But
once you choose one, the grammar is not recursive, therefore you can
use just regular expressions to scan tokens, or even, for such simple
regular expressions, you may not bother and just write the state
machine yourself.

It's easy to parse, because you don't need a parser, a scanner is enough.

-- 
__Pascal Bourguignon__

From: Øyvin Halfdan Thuv
Subject: Re: reading formatted text
Date: Wed, 12 Mar 2008 12:43:57 +0000
Message-ID: <slrnftfk0d.jnr.oyvinht@decibel.pvv.ntnu.no>

On 2008-03-12, ············@gmail.com <············@gmail.com> wrote:

> Libraries listed in http://www.cliki.net/parser are all optimised for
> parsing a given format and so use very specific code (as with puri
> that makes use of specialized state machines instead of regular
> expressions).

I just added the "parser" category to cl-earley-parser so it is listed there
as well. cl-earley-parser is not tied (optimized) to a specfic format - rather
it lets you define the format yourself using Backus-Naur form.

-- 
Oyvin

From: Carl Taylor
Subject: Re: reading formatted text
Date: Tue, 11 Mar 2008 14:58:02 +0000
Message-ID: <_5xBj.454$D_3.46@bgtnsc05-news.ops.worldnet.att.net>

············@gmail.com wrote:
> Hi,
>
> Are there Lisp idioms to read formatted text ?
> I know about the read-line function, but what else (apart from
> painfully reading one char at a time, searching manually for word/
> numbers bounaries) ?

Depends on exactly what you want to do. But you could investigate string 
streams, split-sequence, and read-sequence.
The function SPLIT-SEQUENCE is a LispWorks extention, but it's not hard to write 
your own.  READ-SEQUENCE is great for
reading a file in large blocks.  And string streams are useful for dealing with 
long strings.  Use the permuted index
feature of the CLHS to quickly find all related Lisp operators.

Carl Taylor

A couple of examples with formatted text from an input string.

(delete ""
          (split-sequence '(#\Space #\Newline)
                                "The time has come the walrus
                                  said, to talk of many things.")
         :test #'string=)

(let ((data "The time has come the walrus said, to talk of many things.")
       (block (make-list 60)))
  (with-input-from-string (sis data)
     (read-sequence block sis)
     (princ block)))