I'm sorry if this is a FAQ - I've searched using google, and I can't
find the answer.
I am refreshing my Lisp knowledge this spring. I had planned to learn
Ruby but, after reading how much Ruby owes to Lisp, decided to switch.
It's been 15 years since I even looked at Lisp code, but I remember it
as a lot of fun. I'm still having fun.
I am using Clisp under cygwin to write some scripts for my work. As
a result of this mixed environment, I am running into some problems
with files that contain both dos and unix newlines.
Could someone suggest a way to distinguish between the two newlines
so that read-line grabs everything from the stream that I want it to?
How would you parse them if there is no way to make read-line work?
Thanks,
-Ben
>I am using Clisp under cygwin to write some
>scripts for my work. As a result of this mixed
>environment, I am running into some problems
>with files that contain both dos and unix newlines.
I am also using clisp in windows, but I don't think cygwin is involved.
I also have had no difficulty using read-line on files generated with
notepad. If cygwin is giving your trouble, maybe you should switch. I
got my clisp from here
http://common-lisp.net/project/lispbox/
as part of Lisp in a Box, which also includes an emacs editor nicely
set up to support an REPL and Lisp file editing.
"Ben" <·····@yahoo.com> writes:
>
> I'm sorry if this is a FAQ - I've searched using google, and I can't
> find the answer.
>
> I am using Clisp under cygwin to write some scripts for my work. As
> a result of this mixed environment, I am running into some problems
> with files that contain both dos and unix newlines.
>
> Could someone suggest a way to distinguish between the two newlines
> so that read-line grabs everything from the stream that I want it to?
> How would you parse them if there is no way to make read-line work?
Well, having struggled with this myself, I find that there isn't really
any great answer in general. For a general solution, one has to have a
special stream type that maintains some state information. There are
some things that can be done if one only has to work with files and not
interactive streams such as terminal input or network inputs. It can
also be simplified a bit if one can assume that a particular stream will
only have a single type of line ending in it.
I don't actually have any simple lisp code for handling the issue in
general, but I can outline a general algorithm that will handle the
issue. It is best thought of as a finite state automaton that operates
on a stream with a bit of extra state in it (for the FSA).
Start
--Return--> return collected string, reset string ==> Return
--Linefeed--> return collected string, reset string ==> Start
--EOF--> if string has characters return them, reset string
else signal EOF ==> EOF
--(Other)--> collect character into string, ==> Start
Return
--Linefeed--> do nothing ==> Start
--Return--> return empty string ==> Return
--EOF--> signal EOF ==> EOF
--(Other)-->
EOF
(a read will always return EOF)
One of the key considerations is to always return a line on the FIRST
potential line terminating character. This will stop you from hanging
if you get a Mac-format file with one CR's on an interactive stream or
via a network interface. That is why there is a need for some state, so
that you remember the last character you've seen so you know if a
linefeed is encountered, if it is part of the CR-LF pair or not.
The real problem is that unless you implement some sort of buffering
scheme, doing READ-CHAR to get the items into memory will be really
slow. Buffering with READ-SEQUENCE would be faster, but it would
require a lot more machinery to maintain the buffer and handle buffer
fills that span more than one line.
A low-performance, non-interactive version would be pretty simple to
code, but high performance is much trickier to achieve.
A high-performance version would be a good little bit of utility code,
though.
--
Thomas A. Russ, USC/Information Sciences Institute
Thanks for the detailed reply. I was afraid the answer would be that.
For the moment I've slapped in a (nasty) solution.
It's embarassing, but I'm parsing side-by-side diffs, so if the line is
< 80 chars, I know I hit an illegal newline, and I just read the next
line.
I WILL eventually write that stream utility, but I've spent more of
work's time learning lisp than I really should, so I feel I should
produce something useful. I can release the nasty solution this
afternoon.
Thanks again.
-Ben
From: Sam Steingold
Subject: Re: unix / dos newline - possible FAQ?
Date:
Message-ID: <ufywsm9jy.fsf@gnu.org>
> * Ben <·····@lnubb.pbz> [2005-05-12 10:47:10 -0700]:
>
> I'm sorry if this is a FAQ - I've searched using google, and I can't
> find the answer.
<http://www.google.com/search?q=clisp+newline>
the first hit answers your questions:
<http://clisp.cons.org/impnotes/clhs-newline.html>
a little bit more detailed explanation is here:
<http://www.podval.org/~sds/clisp/impnotes/clhs-newline.html>
> Could someone suggest a way to distinguish between the two newlines
> so that read-line grabs everything from the stream that I want it to?
> How would you parse them if there is no way to make read-line work?
READ-LINE treats all newlines the same,
as specifically recommended by the "Unicode Newline Guidelines"
<http://www.unicode.org/reports/tr13/tr13-9.html>.
if you need to distinguish between CR, LF and CR+LF in your input files,
you should use READ-CHAR-SEQUENCE instead of READ-LINE
<http://clisp.cons.org/impnotes/stream-dict.html#bulk-io>
--
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.openvotingconsortium.org/> <http://www.jihadwatch.org/>
<http://www.honestreporting.com> <http://www.memri.org/> <http://ffii.org/>
Don't use force -- get a bigger hammer.
"Ben" <·····@yahoo.com> writes:
> I'm sorry if this is a FAQ - I've searched using google, and I can't
> find the answer.
>
> I am refreshing my Lisp knowledge this spring. I had planned to learn
> Ruby but, after reading how much Ruby owes to Lisp, decided to switch.
> It's been 15 years since I even looked at Lisp code, but I remember it
> as a lot of fun. I'm still having fun.
>
> I am using Clisp under cygwin to write some scripts for my work. As
> a result of this mixed environment, I am running into some problems
> with files that contain both dos and unix newlines.
>
> Could someone suggest a way to distinguish between the two newlines
> so that read-line grabs everything from the stream that I want it to?
> How would you parse them if there is no way to make read-line work?
You can specify explicitely the newline you want with clisp, giving to
:external-format a clisp specific value. But in anycase, when clisp
reads text files, it accepts any newline sequence:
(dolist (onewline '(:unix :dos :mac))
(with-open-file (out "test.txt" :direction :output
:if-does-not-exist :create :if-exists :supersede
:external-format (ext:make-encoding
:charset charset:iso-8859-1
:line-terminator onewline))
(format out "line1~%line2~%"))
(with-open-file (in "test.txt" :direction :input
:element-type '(unsigned-byte 8))
(loop for byte = (read-byte in nil nil)
while byte do (format t "~2,'0X " byte)
finally (format t "~%")))
(dolist (inewline '(:unix :dos :mac))
(with-open-file (in "test.txt" :direction :input
:external-format (ext:make-encoding
:charset charset:iso-8859-1
:line-terminator inewline))
(let ((line (read-line in)))
(format t "written as ~6A, read as ~6A: ~3D ~S~%"
onewline inewline (length line) line)))) )
6C 69 6E 65 31 0A 6C 69 6E 65 32 0A
written as UNIX , read as UNIX : 5 "line1"
written as UNIX , read as DOS : 5 "line1"
written as UNIX , read as MAC : 5 "line1"
6C 69 6E 65 31 0D 0A 6C 69 6E 65 32 0D 0A
written as DOS , read as UNIX : 5 "line1"
written as DOS , read as DOS : 5 "line1"
written as DOS , read as MAC : 5 "line1"
6C 69 6E 65 31 0D 6C 69 6E 65 32 0D
written as MAC , read as UNIX : 5 "line1"
written as MAC , read as DOS : 5 "line1"
written as MAC , read as MAC : 5 "line1"
NIL
Now, I'm using clisp-2.33.83, and AFAIK, it worked the same with
clisp-2.33.2, but I cannot say if you're using an older version.
Upgrade!
--
__Pascal Bourguignon__ http://www.informatimago.com/
This is a signature virus. Add me to your signature and help me to live
Pascal Bourguignon <···@informatimago.com> writes:
> But in anycase, when clisp
> reads text files, it accepts any newline sequence:
Hooray for clisp!
>
> 6C 69 6E 65 31 0A 6C 69 6E 65 32 0A
> written as UNIX , read as UNIX : 5 "line1"
> written as UNIX , read as DOS : 5 "line1"
> written as UNIX , read as MAC : 5 "line1"
> 6C 69 6E 65 31 0D 0A 6C 69 6E 65 32 0D 0A
> written as DOS , read as UNIX : 5 "line1"
> written as DOS , read as DOS : 5 "line1"
> written as DOS , read as MAC : 5 "line1"
> 6C 69 6E 65 31 0D 6C 69 6E 65 32 0D
> written as MAC , read as UNIX : 5 "line1"
> written as MAC , read as DOS : 5 "line1"
> written as MAC , read as MAC : 5 "line1"
> NIL
But what does it get for the second line? Does it get those correct as well!
--
Thomas A. Russ, USC/Information Sciences Institute
···@sevak.isi.edu (Thomas A. Russ) writes:
> Pascal Bourguignon <···@informatimago.com> writes:
>
>> But in anycase, when clisp
>> reads text files, it accepts any newline sequence:
>
> Hooray for clisp!
>
>>
>> 6C 69 6E 65 31 0A 6C 69 6E 65 32 0A
>> written as UNIX , read as UNIX : 5 "line1"
>> written as UNIX , read as DOS : 5 "line1"
>> written as UNIX , read as MAC : 5 "line1"
>> 6C 69 6E 65 31 0D 0A 6C 69 6E 65 32 0D 0A
>> written as DOS , read as UNIX : 5 "line1"
>> written as DOS , read as DOS : 5 "line1"
>> written as DOS , read as MAC : 5 "line1"
>> 6C 69 6E 65 31 0D 6C 69 6E 65 32 0D
>> written as MAC , read as UNIX : 5 "line1"
>> written as MAC , read as DOS : 5 "line1"
>> written as MAC , read as MAC : 5 "line1"
>> NIL
>
> But what does it get for the second line? Does it get those correct as well!
You have the source!
Yes, of course.
(dolist (onewline '(:unix :dos :mac))
(with-open-file (out "test.txt" :direction :output
:if-does-not-exist :create :if-exists :supersede
:external-format (ext:make-encoding
:charset charset:iso-8859-1
:line-terminator onewline))
(format out "line1~%line2~%"))
(with-open-file (in "test.txt" :direction :input
:element-type '(unsigned-byte 8))
(loop for byte = (read-byte in nil nil)
while byte do (format t "~2,'0X " byte)
finally (format t "~%")))
(dolist (inewline '(:unix :dos :mac))
(with-open-file (in "test.txt" :direction :input
:external-format (ext:make-encoding
:charset charset:iso-8859-1
:line-terminator inewline))
(let ((line1 (read-line in))
(line2 (read-line in)))
(format t "written as ~6A, read as ~6A: ~{~% ~3D ~S~}~%"
onewline inewline
(list (length line1) line1
(length line2) line2))))))
6C 69 6E 65 31 0A 6C 69 6E 65 32 0A
written as UNIX , read as UNIX :
5 "line1"
5 "line2"
written as UNIX , read as DOS :
5 "line1"
5 "line2"
written as UNIX , read as MAC :
5 "line1"
5 "line2"
6C 69 6E 65 31 0D 0A 6C 69 6E 65 32 0D 0A
written as DOS , read as UNIX :
5 "line1"
5 "line2"
written as DOS , read as DOS :
5 "line1"
5 "line2"
written as DOS , read as MAC :
5 "line1"
5 "line2"
6C 69 6E 65 31 0D 6C 69 6E 65 32 0D
written as MAC , read as UNIX :
5 "line1"
5 "line2"
written as MAC , read as DOS :
5 "line1"
5 "line2"
written as MAC , read as MAC :
5 "line1"
5 "line2"
NIL
--
__Pascal Bourguignon__ http://www.informatimago.com/
The mighty hunter
Returns with gifts of plump birds,
Your foot just squashed one.
Pascal Bourguignon <···@informatimago.com> writes:
>
> ···@sevak.isi.edu (Thomas A. Russ) writes:
> >
> > But what does it get for the second line? Does it get those correct as well!
>
> You have the source!
Yes, but not the clisp implementation to run it on....
> Yes, of course.
Cool. I wish this were more universally true, both across Common Lisp
implementations and across other languages as well.
--
Thomas A. Russ, USC/Information Sciences Institute
On 12 May 2005 10:47:10 -0700, <·····@yahoo.com> wrote:
> Could someone suggest a way to distinguish between the two newlines
> so that read-line grabs everything from the stream that I want it to?
> How would you parse them if there is no way to make read-line work?
Emacs.
Emacs will convert text files, can be run from the command line, etc.
--
Everyman has three hearts;
one to show the world, one to show friends, and one only he knows.