clisp's read-char can't read unicode?

From: Wang Jiaji
Subject: clisp's read-char can't read unicode?
Date: Fri, 01 May 2009 15:24:00 +0000
Message-ID: <086b49ac-41be-4a99-966d-859dc194dc9a@s1g2000prd.googlegroups.com>

I'm using clisp on Windows, while parsing an English text file
donwloaded from Internet, I got an "invalid multibyte or wide
character" error, caused by read-char, then I opened that file and
saved it anew, and the problem disappeared. I tried a lot of files,
they all contain some kind of "invalid multibye character" and can be
corrected by re-saving them, is that supposed to happen?

BTW, I don't really know if they're in unicode or not, but I guess
this could be the only reason.

Re: clisp's read-char can't read unicode? Pascal J. Bourguignon
- Re: clisp's read-char can't read unicode? zunumi
  - Re: clisp's read-char can't read unicode? Pascal J. Bourguignon

From: Pascal J. Bourguignon
Subject: Re: clisp's read-char can't read unicode?
Date: Fri, 01 May 2009 16:22:12 +0000
Message-ID: <87zldw7pzf.fsf@galatea.local>

Wang Jiaji <··········@gmail.com> writes:

> I'm using clisp on Windows, while parsing an English text file
> donwloaded from Internet, I got an "invalid multibyte or wide
> character" error, caused by read-char, then I opened that file and
> saved it anew, and the problem disappeared. I tried a lot of files,
> they all contain some kind of "invalid multibye character" and can be
> corrected by re-saving them, is that supposed to happen?

You have to specify the encoding of your streams.

This may be done with one of the -E options, 

or with one of the following "variables" (some are symbol-macro
expanding to accessor so beware (use EXT:LETF instead of LET):
*DEFAULT-FILE-ENCODING* *MISC-ENCODING* *PATHNAME-ENCODING*
*TERMINAL-ENCODING* *HTTP-ENCODING*

or by specifying the :external-format when you open the stream,
passing an encoding found in the CHARSET package or one you make with
EXT:MAKE-ENCODING.


> BTW, I don't really know if they're in unicode or not, but I guess
> this could be the only reason.

Of course, the first thing to know is what encoding your data is
stored in.


Try: (with-open-file (inp file :external-format charset:utf-8)      (read-line inp))
or   (with-open-file (inp file :external-format charset:iso-8859-1) (read-line inp))

-- 
__Pascal Bourguignon__

From: zunumi
Subject: Re: clisp's read-char can't read unicode?
Date: Sat, 02 May 2009 14:22:27 +0000
Message-ID: <9bb7c270-887a-4853-9b74-6179d3f08d6e@w31g2000prd.googlegroups.com>

On May 2, 12:22 am, ····@informatimago.com (Pascal J. Bourguignon)
wrote:
> Wang Jiaji <··········@gmail.com> writes:
> > I'm using clisp on Windows, while parsing an English text file
> > donwloaded from Internet, I got an "invalid multibyte or wide
> > character" error, caused by read-char, then I opened that file and
> > saved it anew, and the problem disappeared. I tried a lot of files,
> > they all contain some kind of "invalid multibye character" and can be
> > corrected by re-saving them, is that supposed to happen?
>
> You have to specify the encoding of your streams.
>
> This may be done with one of the -E options,
>
> or with one of the following "variables" (some are symbol-macro
> expanding to accessor so beware (use EXT:LETF instead of LET):
> *DEFAULT-FILE-ENCODING* *MISC-ENCODING* *PATHNAME-ENCODING*
> *TERMINAL-ENCODING* *HTTP-ENCODING*
>
> or by specifying the :external-format when you open the stream,
> passing an encoding found in the CHARSET package or one you make with
> EXT:MAKE-ENCODING.
>
> > BTW, I don't really know if they're in unicode or not, but I guess
> > this could be the only reason.
>
> Of course, the first thing to know is what encoding your data is
> stored in.
>
> Try: (with-open-file (inp file :external-format charset:utf-8)      (read-line inp))
> or   (with-open-file (inp file :external-format charset:iso-8859-1) (read-line inp))

Thanks, I didn't know that with-open-file has such an option.
Is there a way to detect the encoding of a file?

>
> --
> __Pascal Bourguignon__

From: Pascal J. Bourguignon
Subject: Re: clisp's read-char can't read unicode?
Date: Sat, 02 May 2009 17:12:09 +0000
Message-ID: <87bpqb77km.fsf@galatea.local>

zunumi <··········@gmail.com> writes:

> On May 2, 12:22 am, ····@informatimago.com (Pascal J. Bourguignon)
> wrote:
>> Wang Jiaji <··········@gmail.com> writes:
>> > I'm using clisp on Windows, while parsing an English text file
>> > donwloaded from Internet, I got an "invalid multibyte or wide
>> > character" error, caused by read-char, then I opened that file and
>> > saved it anew, and the problem disappeared. I tried a lot of files,
>> > they all contain some kind of "invalid multibye character" and can be
>> > corrected by re-saving them, is that supposed to happen?
>>
>> You have to specify the encoding of your streams.
>>
>> This may be done with one of the -E options,
>>
>> or with one of the following "variables" (some are symbol-macro
>> expanding to accessor so beware (use EXT:LETF instead of LET):
>> *DEFAULT-FILE-ENCODING* *MISC-ENCODING* *PATHNAME-ENCODING*
>> *TERMINAL-ENCODING* *HTTP-ENCODING*
>>
>> or by specifying the :external-format when you open the stream,
>> passing an encoding found in the CHARSET package or one you make with
>> EXT:MAKE-ENCODING.
>>
>> > BTW, I don't really know if they're in unicode or not, but I guess
>> > this could be the only reason.
>>
>> Of course, the first thing to know is what encoding your data is
>> stored in.
>>
>> Try: (with-open-file (inp file :external-format charset:utf-8)      (read-line inp))
>> or   (with-open-file (inp file :external-format charset:iso-8859-1) (read-line inp))
>
> Thanks, I didn't know that with-open-file has such an option.
> Is there a way to detect the encoding of a file?

There is no algorithm.
There are heuristics.

First a file that would contain the bytes 106 107 108 109 110 111
could represent the characters "¦,%_>?" if it was encoded in EBCDIC,
or the characters "jklmno" if it was encoded in ASCII.

Even if you know what character are stored in the file, you cannot in
general determine the encoding.    If you know that a file containing
the bytes  106 107 108 109 110 111 actually represents the characters
"jklmno", then you cannot say whether it is encoded in ASCII, in
ISO-8859-1, in ISO-8859-2, ... in ISO-8859-15, or in UTF-8, etc.


So if the file contains the bytes:
    #(76 97 32 112 105 241 97 116 97 32 99 117 101 115 116 97 32 49 48 32 164 46)
    iso-8859-1  -> "La piñata cuesta 10 ¤."
    iso-8859-15 -> "La piñata cuesta 10 €."
between the iso-8859-1 and iso-8859-15 encodings, you could prefer to
choose the later.  (But the file could predate the EURO, and the price
could indeed have been 10 pesetas (or 10 pesos or whatever, ¤ is a
place holder for whatever currency unit)).


But if the file contains the bytes:
    #(76 101 32 106 97 109 98 111 110 32 233 116 97 105 116 32 98 111 110 46)
    iso-8859-1  -> "Le jambon était bon."
    iso-8859-15 -> "Le jambon était bon."
You cannot choose between these encoding (not that it matter much if
the file contains only these characters).


-- 
__Pascal Bourguignon__