From: Bob Felts
Subject: Cross-platform read-line question
Date: 
Message-ID: <1hb9sjq.10ya3421o9tf9cN%wrf3@stablecross.com>
I have to work with files created on DOS, Unix, and Classic Mac
machines.

Is there any way to get read-line to recognize the DOS \r\n, Unix \n,
and Classic Mac \r line endings?

Or am I going to have to roll my own using read-char (not that I haven't
had to do this before in other languages).

From: Mikalai
Subject: Re: Cross-platform read-line question
Date: 
Message-ID: <1140822727.900897.222300@j33g2000cwa.googlegroups.com>
Bob Felts wrote:
> I have to work with files created on DOS, Unix, and Classic Mac
> machines.
>
> Is there any way to get read-line to recognize the DOS \r\n, Unix \n,
> and Classic Mac \r line endings?
>
> Or am I going to have to roll my own using read-char (not that I haven't
> had to do this before in other languages).

>From http://www.lisp.org/HyperSpec/Body/fun_read-line.html
"
Description:
...
The primary value, line, is the line that is read, represented as a
string (without the trailing newline, if any).
"

There should be now newlines. As far as I understand, it is implementor
of lisp, who worries how to recognize newline to properly strip it
away.

I haven't try to read, for example, windows' file with win newlines
with unix lisp, which may not care about win's things. So, this is more
of a question of using files from a different platform. I do not know
what to say about it, and I would want to hear from someone, if
read-line properly strips "alien" newlines.
From: Edi Weitz
Subject: Re: Cross-platform read-line question
Date: 
Message-ID: <u1wxsjp6v.fsf@agharta.de>
On Fri, 24 Feb 2006 15:50:18 -0500, ····@stablecross.com (Bob Felts) wrote:

> Is there any way to get read-line to recognize the DOS \r\n, Unix
> \n, and Classic Mac \r line endings?

I'm assuming you're trying to read files with different line endings
from one program.  If you're only ever reading the "native" format of
your platform, you shouldn't need to care about this.

Generally, this is implementation-dependent - see the EXTERNAL-FORMAT
keyword argument to OPEN.  Most implementations will provide several
different external formats (or variants of external formats) to deal
with the different line endings you mention.

For an approach that works across many popular implementations see
(shameless plug):

  <http://weitz.de/flexi-streams/>

Cheers,
Edi.

-- 

European Common Lisp Meeting 2006: <http://weitz.de/eclm2006/>

Real email: (replace (subseq ·········@agharta.de" 5) "edi")
From: Pascal Bourguignon
Subject: Re: Cross-platform read-line question
Date: 
Message-ID: <87lkw0b5va.fsf@thalassa.informatimago.com>
····@stablecross.com (Bob Felts) writes:

> I have to work with files created on DOS, Unix, and Classic Mac
> machines.
>
> Is there any way to get read-line to recognize the DOS \r\n, Unix \n,
> and Classic Mac \r line endings?
>
> Or am I going to have to roll my own using read-char (not that I haven't
> had to do this before in other languages).

Quite easy:

(with-open-file (file "file-path"
                 :external-format #+clisp (ext:make-encoding 
                                             :charset charset
                                             :line-terminator line-terminator)
                                  #-clisp (error "I don't know how to specify
                                                  the encoding on ~S"
                                                  (lisp-implementation-type)))
   (read-line file))

charset can be one of the following from the CHARSET package:

                  ARMSCII-8 ASCII BASE64 BIG5 BIG5-HKSCS CP1133 CP1250 
                  CP1251 CP1252 CP1253 CP1254 CP1255 CP1256 CP1257 
                  CP1258 CP437 CP437-IBM CP737 CP775 CP850 CP852 
                  CP852-IBM CP855 CP857 CP860 CP860-IBM CP861 CP861-IBM 
                  CP862 CP862-IBM CP863 CP863-IBM CP864 CP864-IBM CP865 
                  CP865-IBM CP866 CP869 CP869-IBM CP874 CP874-IBM CP932 
                  CP936 CP949 CP950 EUC-CN EUC-JP EUC-KR EUC-TW GB18030 
                  GBK GEORGIAN-ACADEMY GEORGIAN-PS HP-ROMAN8 
                  ISO-2022-CN ISO-2022-CN-EXT ISO-2022-JP ISO-2022-JP-2 
                  ISO-2022-KR ISO-8859-1 ISO-8859-10 ISO-8859-13 
                  ISO-8859-14 ISO-8859-15 ISO-8859-16 ISO-8859-2 
                  ISO-8859-3 ISO-8859-4 ISO-8859-5 ISO-8859-6 
                  ISO-8859-7 ISO-8859-8 ISO-8859-9 JAVA JIS_X0201 JOHAB 
                  KOI8-R KOI8-U MAC-ARABIC MAC-CENTRAL-EUROPE 
                  MAC-CROATIAN MAC-CYRILLIC MAC-DINGBAT MAC-GREEK 
                  MAC-HEBREW MAC-ICELAND MAC-ROMAN MAC-ROMANIA 
                  MAC-SYMBOL MAC-THAI MAC-TURKISH MAC-UKRAINE MACINTOSH 
                  NEXTSTEP SHIFT-JIS TCVN TIS-620 UCS-2 UCS-4 
                  UNICODE-16 UNICODE-16-BIG-ENDIAN 
                  UNICODE-16-LITTLE-ENDIAN UNICODE-32 
                  UNICODE-32-BIG-ENDIAN UNICODE-32-LITTLE-ENDIAN UTF-16 
                  UTF-7 UTF-8 VISCII WINDOWS-1250 WINDOWS-1251 
                  WINDOWS-1252 WINDOWS-1253 WINDOWS-1254 WINDOWS-1255 
                  WINDOWS-1256 WINDOWS-1257 WINDOWS-1258

for example CHARSET:MAC-ROMAN may be good for Macintosh files.

line-terminator can be one of:   :UNIX  :MAC  :DOS

All this is of course documented in the implementation notes:
http://clisp.cons.org/impnotes.html

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
Small brave carnivores
Kill pine cones and mosquitoes
Fear vacuum cleaner
From: Bob Felts
Subject: Re: Cross-platform read-line question
Date: 
Message-ID: <1hbaa58.kvncmq17cv8c0N%wrf3@stablecross.com>
Pascal Bourguignon <······@informatimago.com> wrote:

> ····@stablecross.com (Bob Felts) writes:
> 
> > I have to work with files created on DOS, Unix, and Classic Mac
> > machines.
> >
> > Is there any way to get read-line to recognize the DOS \r\n, Unix \n,
> > and Classic Mac \r line endings?
> >
> > Or am I going to have to roll my own using read-char (not that I haven't
> > had to do this before in other languages).
> 
> Quite easy:
> 
> (with-open-file (file "file-path"
>                  :external-format #+clisp (ext:make-encoding 
>                                              :charset charset
>                                              :line-terminator line-terminator)
>                                   #-clisp (error "I don't know how to specify
>                                                   the encoding on ~S"
>                                                   (lisp-implementation-type)))
>    (read-line file))
> 
> charset can be one of the following from the CHARSET package:
[...]
> 
> for example CHARSET:MAC-ROMAN may be good for Macintosh files.
> 
> line-terminator can be one of:   :UNIX  :MAC  :DOS

But I may have any number of files to batch process that were created on
any of these 3 machines.  IOW, I don't know ahead of time whether a file
is :UNIX, :MAC, or :DOS.

> 
> All this is of course documented in the implementation notes:
> http://clisp.cons.org/impnotes.html

Thanks, I'll check it out.
From: Pascal Bourguignon
Subject: Re: Cross-platform read-line question
Date: 
Message-ID: <87accfbcs5.fsf@thalassa.informatimago.com>
····@stablecross.com (Bob Felts) writes:
> But I may have any number of files to batch process that were created on
> any of these 3 machines.  IOW, I don't know ahead of time whether a file
> is :UNIX, :MAC, or :DOS.

Then you'll have to read the BYTES in these files first:

(with-open-file (file "file-path" :external-format '(unsigned-byte 8)) 
    ...)

and try to determine the encoding used, using heuristics.

For the line terminator, if all occurence of CR is followed by a LF,
then you can deduce it's a MS-DOS file.  If there are only CR and no
LF, then it's a Macintosh file. If there are only LF and no CR, then
it's a Unix file.  Otherwise, you'll have to fall back to some
heuristic too.

For the encoding, you could try each encoding one after the other, and
select one of the encoding that gives no error, and for which all the
words spell check correctly (or at least, all the words of which some
characters would be different with a different encoding).  Of course,
for this you need to identify the language in which these words are
written.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

CAUTION: The mass of this product contains the energy equivalent of
85 million tons of TNT per net ounce of weight.
From: Bob Felts
Subject: Re: Cross-platform read-line question
Date: 
Message-ID: <1hbbhdc.18nrrrypw0prsN%wrf3@stablecross.com>
Pascal Bourguignon <······@informatimago.com> wrote:

> ····@stablecross.com (Bob Felts) writes:
> > But I may have any number of files to batch process that were created on
> > any of these 3 machines.  IOW, I don't know ahead of time whether a file
> > is :UNIX, :MAC, or :DOS.
> 
> Then you'll have to read the BYTES in these files first:
> 
> (with-open-file (file "file-path" :external-format '(unsigned-byte 8))
>     ...)
> 
> and try to determine the encoding used, using heuristics.
> 
> For the line terminator, if all occurence of CR is followed by a LF,
> then you can deduce it's a MS-DOS file.  If there are only CR and no
> LF, then it's a Macintosh file. If there are only LF and no CR, then
> it's a Unix file.  Otherwise, you'll have to fall back to some
> heuristic too.
> 
> For the encoding, you could try each encoding one after the other, and
> select one of the encoding that gives no error, and for which all the
> words spell check correctly (or at least, all the words of which some
> characters would be different with a different encoding).  Of course,
> for this you need to identify the language in which these words are
> written.

I think I'm just going to use  WITH-OUTPUT-TO-STRING and READ-CHAR and
roll my own. 
From: Thomas A. Russ
Subject: Re: Cross-platform read-line question
Date: 
Message-ID: <ymizmkcado7.fsf@sevak.isi.edu>
····@stablecross.com (Bob Felts) writes:
> 
> I think I'm just going to use  WITH-OUTPUT-TO-STRING and READ-CHAR and
> roll my own. 

I can save you the trouble:


(in-package "CL-USER")

(defun read-any-line (stream &optional eof-error-p eof-value)
  "Simple cross-platform line reader.  For non-interactive, non-network
streams only.  Slow."

  ;; Handle EOF issues upfront:
  (if eof-error-p
    (peek-char nil stream t)
    (unless (peek-char nil stream nil nil)
      (return-from read-any-line eof-value)))

  ;; Collect string output.
  (with-output-to-string (buffer)
    (loop for ch = (read-char stream nil nil)
          until (null ch)
          do (case ch
               (#\Return
                (when (eql (peek-char nil stream nil nil) #\Linefeed)
                  (read-char stream nil nil))
                (loop-finish))
               (#\Linefeed
                (loop-finish))
               (t (write-char ch buffer))))))
          


-- 
Thomas A. Russ,  USC/Information Sciences Institute
From: Bob Felts
Subject: Re: Cross-platform read-line question
Date: 
Message-ID: <1hbfmy8.2m7cwq6ambfkN%wrf3@stablecross.com>
Thomas A. Russ <···@sevak.isi.edu> wrote:

> ····@stablecross.com (Bob Felts) writes:
> > 
> > I think I'm just going to use  WITH-OUTPUT-TO-STRING and READ-CHAR and
> > roll my own. 
> 
> I can save you the trouble:

Thanks!
From: Joerg Hoehle
Subject: Re: Cross-platform read-line question
Date: 
Message-ID: <u64myj8oc.fsf@users.sourceforge.net>
Pascal Bourguignon <······@informatimago.com> writes:
> ····@stablecross.com (Bob Felts) writes:
> > Is there any way to get read-line to recognize the DOS \r\n, Unix \n,
> > and Classic Mac \r line endings?

>                  :external-format #+clisp (ext:make-encoding 
> line-terminator can be one of:   :UNIX  :MAC  :DOS

> All this is of course documented in the implementation notes:
> http://clisp.cons.org/impnotes.html

Pascal, you provide a clisp-specific solution, mention the impnotes
and fail to explain to the OP that CLISP does not care on input about
whether \r, \r\n or \n is being used??

That also is documented in the impnotes!

The OP can provide any line-terminator he likes, it will be ignored
right away for :INPUT.  CLISP will automatically read all three
formats (and likely even messed up mixed formats within a single file,
as I've seen happen when text files are edited here and there, but
also in responses from HTTP servers -- did you consider that?).

As a result,
(open
  :direction :input
  :external-format charset:xyz
  ...
is enough. No need to use
(ext:make-encoding :charset charset:xyz :line-terminator :zyx)

But the line-terminator is important for output. HTTP mandates :DOS
(CRLF) style in the headers!  (Some HTTP servers will transform CGI
header output but don't count on that).

Regards,
	Jorg Hohle
Telekom/T-Systems Technology Center
From: Pascal Bourguignon
Subject: Re: Cross-platform read-line question
Date: 
Message-ID: <87d5h60wl7.fsf@thalassa.informatimago.com>
Joerg Hoehle <······@users.sourceforge.net> writes:

> Pascal Bourguignon <······@informatimago.com> writes:
>> ····@stablecross.com (Bob Felts) writes:
>> > Is there any way to get read-line to recognize the DOS \r\n, Unix \n,
>> > and Classic Mac \r line endings?
>
>>                  :external-format #+clisp (ext:make-encoding 
>> line-terminator can be one of:   :UNIX  :MAC  :DOS
>
>> All this is of course documented in the implementation notes:
>> http://clisp.cons.org/impnotes.html
>
> Pascal, you provide a clisp-specific solution, mention the impnotes
> and fail to explain to the OP that CLISP does not care on input about
> whether \r, \r\n or \n is being used??
>
> That also is documented in the impnotes!
> [...]

Sorry.  I'll take note.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

NEW GRAND UNIFIED THEORY DISCLAIMER: The manufacturer may
technically be entitled to claim that this product is
ten-dimensional. However, the consumer is reminded that this
confers no legal rights above and beyond those applicable to
three-dimensional objects, since the seven new dimensions are
"rolled up" into such a small "area" that they cannot be
detected.