Hi,
Are there Lisp idioms to read formatted text ?
I know about the read-line function, but what else (apart from
painfully reading one char at a time, searching manually for word/
numbers bounaries) ?
-Nicolas
On Mar 11, 8:50 am, ············@gmail.com wrote:
> Hi,
>
> Are there Lisp idioms to read formatted text ?
> I know about the read-line function, but what else (apart from
> painfully reading one char at a time, searching manually for word/
> numbers bounaries) ?
>
> -Nicolas
Give an example of the text you want to parse.
On Mar 11, 3:23 pm, William James <·········@yahoo.com> wrote:
> On Mar 11, 8:50 am, ············@gmail.com wrote:
>
> > Hi,
>
> > Are there Lisp idioms to read formatted text ?
> > I know about the read-line function, but what else (apart from
> > painfully reading one char at a time, searching manually for word/
> > numbers bounaries) ?
>
> > -Nicolas
>
> Give an example of the text you want to parse.
I don't think about a special format, but one could consider usual
examples:
- a csv formatted text, ie line oriented records with (TAB|SPACE|
whatever) separator (with optionnal blank space in between)
- an application specific format, eg '<number>=word <*DIGIT> ...' like
in log files (httpd, auth.log, etc.)
- IPv4/IPv6 textual representation
- etc.
> I don't think about a special format, but one could consider usual
> examples:
> - a csv formatted text, ie line oriented records with (TAB|SPACE|
> whatever) separator (with optionnal blank space in between)
> - an application specific format, eg '<number>=word <*DIGIT> ...' like
> in log files (httpd, auth.log, etc.)
> - IPv4/IPv6 textual representation
> - etc.
http://www.weitz.de/cl-ppcre/
http://www.cliki.net/parser
~ Matthias
On Mar 11, 4:14 pm, Matthias Benkard <··········@gmail.com> wrote:
> > I don't think about a special format, but one could consider usual
> > examples:
> > - a csv formatted text, ie line oriented records with (TAB|SPACE|
> > whatever) separator (with optionnal blank space in between)
> > - an application specific format, eg '<number>=word <*DIGIT> ...' like
> > in log files (httpd, auth.log, etc.)
> > - IPv4/IPv6 textual representation
> > - etc.
>
> http://www.weitz.de/cl-ppcre/http://www.cliki.net/parser
>
> ~ Matthias
Well, I know there are some libraries, but that was not my question.
I'm just wondering if there are Lisp idioms for reading formatted text
or not. Are you all using cl-ppcre when reading text ?
Libraries listed in http://www.cliki.net/parser are all optimised for
parsing a given format and so use very specific code (as with puri
that makes use of specialized state machines instead of regular
expressions).
············@gmail.com writes:
> On Mar 11, 4:14 pm, Matthias Benkard <··········@gmail.com> wrote:
>> > I don't think about a special format, but one could consider usual
>> > examples:
>> > - a csv formatted text, ie line oriented records with (TAB|SPACE|
>> > whatever) separator (with optionnal blank space in between)
>> > - an application specific format, eg '<number>=word <*DIGIT> ...' like
>> > in log files (httpd, auth.log, etc.)
>> > - IPv4/IPv6 textual representation
>> > - etc.
>>
>> http://www.weitz.de/cl-ppcre/http://www.cliki.net/parser
>>
>> ~ Matthias
>
> Well, I know there are some libraries, but that was not my question.
> I'm just wondering if there are Lisp idioms for reading formatted text
> or not. Are you all using cl-ppcre when reading text ?
Well there is no specific shortcut syntax like what you could find in
perl or awk in common-lisp (the package), so nothing idiomatic. A
part, perhaps, the fact that we've got the lisp reader and lisp
printer which can print readably, so we could say that the idiomatic
way to format, print and read in lisp is that.
> Libraries listed in http://www.cliki.net/parser are all optimised for
> parsing a given format and so use very specific code (as with puri
> that makes use of specialized state machines instead of regular
> expressions).
Watch again, there are parser _generators_, they're parametrized by
the grammar of your formated input.
But the point is that most often, formated data has a grammer that is
so simple that a context free parser is overkill, and a mere regexp,
or even just reading the stuff sequentially is enough.
--
__Pascal Bourguignon__
On Mar 12, 3:42 am, ····@informatimago.com (Pascal J. Bourguignon)
wrote:
> But the point is that most often, formated data has a grammer that is
> so simple that a context free parser is overkill, and a mere regexp,
> or even just reading the stuff sequentially is enough.
>
> --
> __Pascal Bourguignon__
Three fields:
He said "Stop, thief!" and collapsed.
88
x,y
As a CSV record:
"He said ""Stop, thief!"" and collapsed.",88,"x,y"
Is CSV like this easy to parse?
On Mar 12, 10:51 am, William James <·········@yahoo.com> wrote:
> On Mar 12, 3:42 am, ····@informatimago.com (Pascal J. Bourguignon)
> wrote:
>
> > But the point is that most often, formated data has a grammer that is
> > so simple that a context free parser is overkill, and a mere regexp,
> > or even just reading the stuff sequentially is enough.
>
> > --
> > __Pascal Bourguignon__
>
> Three fields:
> He said "Stop, thief!" and collapsed.
> 88
> x,y
>
> As a CSV record:
> "He said ""Stop, thief!"" and collapsed.",88,"x,y"
>
> Is CSV like this easy to parse?
No. But you should have take care of what you export and which
delimiter better suit your needs when exporting in a CSV format, don't
you ?
On Mar 12, 6:33 am, ············@gmail.com wrote:
> On Mar 12, 10:51 am, William James <·········@yahoo.com> wrote:
>
> > On Mar 12, 3:42 am, ····@informatimago.com (Pascal J. Bourguignon)
> > wrote:
>
> > > But the point is that most often, formated data has a grammer that is
> > > so simple that a context free parser is overkill, and a mere regexp,
> > > or even just reading the stuff sequentially is enough.
>
> > > --
> > > __Pascal Bourguignon__
>
> > Three fields:
> > He said "Stop, thief!" and collapsed.
> > 88
> > x,y
>
> > As a CSV record:
> > "He said ""Stop, thief!"" and collapsed.",88,"x,y"
>
> > Is CSV like this easy to parse?
>
> No. But you should have take care of what you export and which
> delimiter better suit your needs when exporting in a CSV format, don't
> you ?
You have to export the data you have.
The comma delimiter is fine; even binary data can be included in the
fields.
William James <·········@yahoo.com> wrote:
+---------------
| ····@informatimago.com (Pascal J. Bourguignon) wrote:
| > But the point is that most often, formated data has a grammer that is
| > so simple that a context free parser is overkill, and a mere regexp,
| > or even just reading the stuff sequentially is enough.
|
| Three fields:
| He said "Stop, thief!" and collapsed.
| 88
| x,y
|
| As a CSV record:
| "He said ""Stop, thief!"" and collapsed.",88,"x,y"
|
| Is CSV like this easy to parse?
+---------------
Pretty much so. Here's what I use [may need tweaking for some applications]:
;;; PARSE-CSV-LINE -- Parse one CSV line into a list of fields,
;;; stripping quotes and field-internal escape characters.
;;; Lexical states: '(normal quoted escaped quoted+escaped)
;;;
(defun parse-csv-line (line)
(when (or (string= line "") ; special-case blank lines
(char= #\# (char line 0))) ; or those starting with "#"
(return-from parse-csv-line '()))
(loop for c across line
with state = 'normal
and results = '()
and chars = '() do
(ecase state
((normal)
(case c
((#\") (setq state 'quoted))
((#\\) (setq state 'escaped))
((#\,)
(push (coerce (nreverse chars) 'string) results)
(setq chars '()))
(t (push c chars))))
((quoted)
(case c
((#\") (setq state 'normal))
((#\\) (setq state 'quoted+escaped))
(t (push c chars))))
((escaped) (push c chars) (setq state 'normal))
((quoted+escaped) (push c chars) (setq state 'quoted)))
finally
(progn
(push (coerce (nreverse chars) 'string) results) ; close open field
(return (nreverse results)))))
It handles your sample input:
> (parse-csv-line (read-line))
"He said ""Stop, thief!"" and collapsed.",88,"x,y"
("He said Stop, thief! and collapsed." "88" "x,y")
>
-Rob
-----
Rob Warnock <····@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607
On Mar 12, 7:15 am, ····@rpw3.org (Rob Warnock) wrote:
> William James <·········@yahoo.com> wrote:
> +---------------
> | ····@informatimago.com (Pascal J. Bourguignon) wrote:
> | > But the point is that most often, formated data has a grammer that is
> | > so simple that a context free parser is overkill, and a mere regexp,
> | > or even just reading the stuff sequentially is enough.
> |
> | Three fields:
> | He said "Stop, thief!" and collapsed.
> | 88
> | x,y
> |
> | As a CSV record:
> | "He said ""Stop, thief!"" and collapsed.",88,"x,y"
> |
> | Is CSV like this easy to parse?
> +---------------
>
> Pretty much so. Here's what I use [may need tweaking for some applications]:
>
> ;;; PARSE-CSV-LINE -- Parse one CSV line into a list of fields,
> ;;; stripping quotes and field-internal escape characters.
> ;;; Lexical states: '(normal quoted escaped quoted+escaped)
> ;;;
> (defun parse-csv-line (line)
> (when (or (string= line "") ; special-case blank lines
> (char= #\# (char line 0))) ; or those starting with "#"
> (return-from parse-csv-line '()))
> (loop for c across line
> with state = 'normal
> and results = '()
> and chars = '() do
> (ecase state
> ((normal)
> (case c
> ((#\") (setq state 'quoted))
> ((#\\) (setq state 'escaped))
> ((#\,)
> (push (coerce (nreverse chars) 'string) results)
> (setq chars '()))
> (t (push c chars))))
> ((quoted)
> (case c
> ((#\") (setq state 'normal))
> ((#\\) (setq state 'quoted+escaped))
> (t (push c chars))))
> ((escaped) (push c chars) (setq state 'normal))
> ((quoted+escaped) (push c chars) (setq state 'quoted)))
> finally
> (progn
> (push (coerce (nreverse chars) 'string) results) ; close open field
> (return (nreverse results)))))
>
> It handles your sample input:
>
> > (parse-csv-line (read-line))
> "He said ""Stop, thief!"" and collapsed.",88,"x,y"
>
> ("He said Stop, thief! and collapsed." "88" "x,y")
I think that it should be
("He said \"Stop, thief!\" and collapsed." "88" "x,y")
A single regular expression with "captures" can parse csv records.
This is for records that actually have a comma as the field-separator.
To make parsing easier, append a comma to the string containing the
record; that way, every field ends with a comma.
\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),
This will match one field and its terminator.
Note that one opening parenthesis is followed by ?:.
This is merely to preventing capturing and is not
actually part of the search pattern.
If the field is enclosed in quotes, the second capture will be
nil; if not in quotes, the first capture will be nil.
As the final step, the first capture (if not nil) must have every
doubled quote ("") replaced with one (").
This regex should work with PCRE.
William James <·········@yahoo.com> wrote:
+---------------
| ····@rpw3.org (Rob Warnock) wrote:
| > It handles your sample input:
| > > (parse-csv-line (read-line))
| > "He said ""Stop, thief!"" and collapsed.",88,"x,y"
| >
| > ("He said Stop, thief! and collapsed." "88" "x,y")
|
| I think that it should be
| ("He said \"Stop, thief!\" and collapsed." "88" "x,y")
+---------------
Oops! Sorry, you're right. The specification I coded that to
didn't support doubling double-quotes to quote them. For the
data it was dealing with, it was considered a "feature" that
an input string like this:
some "text, here," with "commas, yes?",more text here
would parse as this:
("some text, here, with commas, yes?" "more text here")
But as I said, "[it] may need tweaking for some applications"... ;-}
-Rob
-----
Rob Warnock <····@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607
William James <·········@yahoo.com> writes:
> On Mar 12, 3:42 am, ····@informatimago.com (Pascal J. Bourguignon)
> wrote:
>
>> But the point is that most often, formated data has a grammer that is
>> so simple that a context free parser is overkill, and a mere regexp,
>> or even just reading the stuff sequentially is enough.
>>
>> --
>> __Pascal Bourguignon__
>
> Three fields:
> He said "Stop, thief!" and collapsed.
> 88
> x,y
>
> As a CSV record:
> "He said ""Stop, thief!"" and collapsed.",88,"x,y"
>
> Is CSV like this easy to parse?
Yes. The difficulty with CSV is that there are various variants. But
once you choose one, the grammar is not recursive, therefore you can
use just regular expressions to scan tokens, or even, for such simple
regular expressions, you may not bother and just write the state
machine yourself.
It's easy to parse, because you don't need a parser, a scanner is enough.
--
__Pascal Bourguignon__
On 2008-03-12, ············@gmail.com <············@gmail.com> wrote:
> Libraries listed in http://www.cliki.net/parser are all optimised for
> parsing a given format and so use very specific code (as with puri
> that makes use of specialized state machines instead of regular
> expressions).
I just added the "parser" category to cl-earley-parser so it is listed there
as well. cl-earley-parser is not tied (optimized) to a specfic format - rather
it lets you define the format yourself using Backus-Naur form.
--
Oyvin
············@gmail.com wrote:
> Hi,
>
> Are there Lisp idioms to read formatted text ?
> I know about the read-line function, but what else (apart from
> painfully reading one char at a time, searching manually for word/
> numbers bounaries) ?
Depends on exactly what you want to do. But you could investigate string
streams, split-sequence, and read-sequence.
The function SPLIT-SEQUENCE is a LispWorks extention, but it's not hard to write
your own. READ-SEQUENCE is great for
reading a file in large blocks. And string streams are useful for dealing with
long strings. Use the permuted index
feature of the CLHS to quickly find all related Lisp operators.
Carl Taylor
A couple of examples with formatted text from an input string.
(delete ""
(split-sequence '(#\Space #\Newline)
"The time has come the walrus
said, to talk of many things.")
:test #'string=)
(let ((data "The time has come the walrus said, to talk of many things.")
(block (make-list 60)))
(with-input-from-string (sis data)
(read-sequence block sis)
(princ block)))