From: ············@gmail.com
Subject: File IO with unicode
Date: 
Message-ID: <1157082649.608074.301870@i42g2000cwa.googlegroups.com>
Hi :

With the simple file text IO as follows:

(with-open-file (stream "/some/file/name.txt")
  (format t "~a~%" (read-line stream)))

I tried two text files, both are Traditional Chinese,
one is Big-5(Codepage 950), the other is UTF-8

[1]> (with-open-file (stream "/temp/Big5_Chinese.txt")
  (format t "~a~%" (read-line stream)))
中文
NIL

This works in CLISP 2.39, but in LispBox (SLIME/ with CLISP upgraded to
2.39),
it shows

Character #\u4E2D cannot be represented in the character set
CHARSET:ISO-8859-1
   [Condition of type EXT:SIMPLE-CHARSET-TYPE-ERROR]


[2]> (with-open-file (stream "/temp/UTF8_Chinese.txt")
  (format t "~a~%" (read-line stream)))

*** - POSIX library error 42 (EILSEQ): Invalid multibyte or wide
character
The following restarts are available:
ABORT          :R1      ABORT
Break 1 [3]> :R1

From: Raffael Cavallaro
Subject: Re: File IO with unicode
Date: 
Message-ID: <2006090100395811272-raffaelcavallaro@pasdespamsilvousplaitmaccom>
On 2006-08-31 23:50:49 -0400, ·············@gmail.com" 
<············@gmail.com> said:

> With the simple file text IO as follows:
> 
> (with-open-file (stream "/some/file/name.txt")
>   (format t "~a~%" (read-line stream)))
> 
> I tried two text files, both are Traditional Chinese,
> one is Big-5(Codepage 950), the other is UTF-8

Maybe you need the :external-format keyword option to with-open-file?
From: Sam Steingold
Subject: Re: File IO with unicode
Date: 
Message-ID: <m3y7t4b3ke.fsf@loiso.podval.org>
> * ············@gmail.com <············@tznvy.pbz> [2006-08-31 20:50:49 -0700]:
>
> Character #\u4E2D cannot be represented in the character set
> CHARSET:ISO-8859-1
>    [Condition of type EXT:SIMPLE-CHARSET-TYPE-ERROR]

http://clisp.cons.org/impnotes/faq.html#faq-enc-err

-- 
Sam Steingold (http://www.podval.org/~sds) on Fedora Core release 5 (Bordeaux)
http://camera.org http://thereligionofpeace.com http://memri.org
http://honestreporting.com http://jihadwatch.org http://mideasttruth.com
Marriage is the sole cause of divorce.
From: ············@gmail.com
Subject: Re: File IO with unicode
Date: 
Message-ID: <1157089534.957174.62310@i3g2000cwc.googlegroups.com>
I expect the file IO library would detect the BOM for the encoding of
text file.
http://en.wikipedia.org/wiki/Byte_Order_Mark
From: kavenchuk
Subject: Re: File IO with unicode
Date: 
Message-ID: <1157093038.370100.274540@m79g2000cwm.googlegroups.com>
············@gmail.com писал(а):

> http://en.wikipedia.org/wiki/Byte_Order_Mark

You read it?

"... Quite a lot of Windows software (including Windows Notepad) adds
one to UTF-8 files. However in Unix-like systems (which make heavy use
of text files for configuration) this practice is not recommended, as
it will interfere with correct processing of important codes such as
the hash-bang at the start of an interpreted script."

WBR, Yaroslav Kavenchuk.
From: Pascal Bourguignon
Subject: Re: File IO with unicode
Date: 
Message-ID: <874pvst8na.fsf@thalassa.informatimago.com>
·············@gmail.com" <············@gmail.com> writes:

> I expect the file IO library would detect the BOM for the encoding of
> text file.
> http://en.wikipedia.org/wiki/Byte_Order_Mark

It works only for unicode files.  

What about ISO-8859-1 files?  What about ISO-2022-JP files?  What
about BIG5 files?  What about US-ASCII files?

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
The rule for today:
Touch my tail, I shred your hand.
New rule tomorrow.
From: Pascal Bourguignon
Subject: Re: File IO with unicode
Date: 
Message-ID: <87bqq0tdpe.fsf@thalassa.informatimago.com>
·············@gmail.com" <············@gmail.com> writes:

> Hi :
>
> With the simple file text IO as follows:
>
> (with-open-file (stream "/some/file/name.txt")
>   (format t "~a~%" (read-line stream)))
>
> I tried two text files, both are Traditional Chinese,
> one is Big-5(Codepage 950), the other is UTF-8
>
> [1]> (with-open-file (stream "/temp/Big5_Chinese.txt")
>   (format t "~a~%" (read-line stream)))
> 中文
> NIL
>
> This works in CLISP 2.39, but in LispBox (SLIME/ with CLISP upgraded to
> 2.39),
> it shows
>
> Character #\u4E2D cannot be represented in the character set
> CHARSET:ISO-8859-1
>    [Condition of type EXT:SIMPLE-CHARSET-TYPE-ERROR]
>
>
> [2]> (with-open-file (stream "/temp/UTF8_Chinese.txt")
>   (format t "~a~%" (read-line stream)))
>
> *** - POSIX library error 42 (EILSEQ): Invalid multibyte or wide
> character
> The following restarts are available:
> ABORT          :R1      ABORT
> Break 1 [3]> :R1

The Common Lisp standard specifies the standard character set to be exactly:


SP   !   "   #   $   %   &   '   (   )   *   +   ,   -   .   /  
 0   1   2   3   4   5   6   7   8   9   :   ;   <   =   >   ?  
 @   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O  
 P   Q   R   S   T   U   V   W   X   Y   Z   [   \   ]   ^   _  
 `   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o  
 p   q   r   s   t   u   v   w   x   y   z   {   |   }   ~      

Nothing less, nothing more.

So why are you expecting to be able to read a file of character
containing any other character than these, with only the standard API?



Now, if you read the error message some more closely, you might notice
something.  Try to read it again:


   Character #\u4E2D cannot be represented in the character set
   CHARSET:ISO-8859-1


What does this error message tell us?


You may want to read again also the CLHS page about OPEN:

   http://www.lispworks.com/documentation/HyperSpec/Body/f_open.htm

and the clisp Implementation Notes

   http://clisp.cons.org/impnotes/stream-dict.html#open

(only for a start, don't hesitate to further follow links, like:

   http://clisp.cons.org/impnotes/encoding.html#def-file-enc
).


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

Nobody can fix the economy.  Nobody can be trusted with their finger
on the button.  Nobody's perfect.  VOTE FOR NOBODY.
From: Stephen Compall
Subject: Re: File IO with unicode
Date: 
Message-ID: <3IQJg.12001$Df2.9693@fe05.news.easynews.com>
············@gmail.com wrote:
> [1]> (with-open-file (stream "/temp/Big5_Chinese.txt")
>   (format t "~a~%" (read-line stream)))
> 中文
> NIL
> 
> This works in CLISP 2.39, but in LispBox (SLIME/ with CLISP upgraded to
> 2.39),
> it shows
> 
> Character #\u4E2D cannot be represented in the character set
> CHARSET:ISO-8859-1
>    [Condition of type EXT:SIMPLE-CHARSET-TYPE-ERROR]

I would guess that this relates to the coding system for communication
between Emacs and CLISP, if you are saying this works with plain CLISP
 but not when connecting with SLIME.

In CLISP, after loading Swank but before starting the server, do:

(setq swank::*coding-system* :utf-8-unix)

In Emacs, after loading SLIME but before connecting to CLISP, do:

(setq slime-net-coding-system 'utf-8-unix)

I forget how to fix the inferior-lisp buffer to do this right, maybe
something about C-x <RET> f?

-- 
Stephen Compall
http://scompall.nocandysw.com/blog
From: Timofei Shatrov
Subject: Re: File IO with unicode
Date: 
Message-ID: <44f7e509.10090208@news.readfreenews.net>
On Fri, 01 Sep 2006 06:47:27 GMT, Stephen Compall
<···············@gmail.com> tried to confuse everyone with this message:

>············@gmail.com wrote:
>> [1]> (with-open-file (stream "/temp/Big5_Chinese.txt")
>>   (format t "~a~%" (read-line stream)))
>> 中文
>> NIL
>> 
>> This works in CLISP 2.39, but in LispBox (SLIME/ with CLISP upgraded to
>> 2.39),
>> it shows
>> 
>> Character #\u4E2D cannot be represented in the character set
>> CHARSET:ISO-8859-1
>>    [Condition of type EXT:SIMPLE-CHARSET-TYPE-ERROR]
>
>I would guess that this relates to the coding system for communication
>between Emacs and CLISP, if you are saying this works with plain CLISP
> but not when connecting with SLIME.
>
>In CLISP, after loading Swank but before starting the server, do:
>
>(setq swank::*coding-system* :utf-8-unix)

I don't think it is necessary, because the next step sets it up already:

>In Emacs, after loading SLIME but before connecting to CLISP, do:
>
>(setq slime-net-coding-system 'utf-8-unix)

Put this line into .emacs

-- 
|Don't believe this - you're not worthless              ,gr---------.ru
|It's us against millions and we can't take them all... |  ue     il   |
|But we can take them on!                               |     @ma      |
|                       (A Wilhelm Scream - The Rip)    |______________|
From: Christopher Brown
Subject: Re: File IO with unicode
Date: 
Message-ID: <1157124091.840616.198210@74g2000cwt.googlegroups.com>
For what it's worth, this also fixed my problem (earlier thread about
sbcl & non-ascii filenames).
After I applied Yaroslav Kavenchuk's patches to sbcl, I found slime
would hang on directory listings.  Changing the external-format as
below fixed that problem.

Cheers,
Chris

Timofei Shatrov wrote:
> On Fri, 01 Sep 2006 06:47:27 GMT, Stephen Compall
> <···············@gmail.com> tried to confuse everyone with this message:
>
> >············@gmail.com wrote:
> >> [1]> (with-open-file (stream "/temp/Big5_Chinese.txt")
> >>   (format t "~a~%" (read-line stream)))
> >> 中文
> >> NIL
> >>
> >> This works in CLISP 2.39, but in LispBox (SLIME/ with CLISP upgraded to
> >> 2.39),
> >> it shows
> >>
> >> Character #\u4E2D cannot be represented in the character set
> >> CHARSET:ISO-8859-1
> >>    [Condition of type EXT:SIMPLE-CHARSET-TYPE-ERROR]
> >
> >I would guess that this relates to the coding system for communication
> >between Emacs and CLISP, if you are saying this works with plain CLISP
> > but not when connecting with SLIME.
> >
> >In CLISP, after loading Swank but before starting the server, do:
> >
> >(setq swank::*coding-system* :utf-8-unix)
>
> I don't think it is necessary, because the next step sets it up already:
>
> >In Emacs, after loading SLIME but before connecting to CLISP, do:
> >
> >(setq slime-net-coding-system 'utf-8-unix)
>
> Put this line into .emacs
>
> --
> |Don't believe this - you're not worthless              ,gr---------.ru
> |It's us against millions and we can't take them all... |  ue     il   |
> |But we can take them on!                               |     @ma      |
> |                       (A Wilhelm Scream - The Rip)    |______________|