Reading Unicode - invalid byte

From: Niklas Kambeitz
Subject: Reading Unicode - invalid byte
Date: Wed, 16 Feb 2005 18:16:06 +0000
Message-ID: <G5MQd.23222$6f.11700@charlie.risq.qc.ca>

How do I read files with weird characters in it?

Using clisp 2.32 with XEmacs on Windows XP, I'm trying to read a file
into a string using this:

(with-open-file (stream "sample.file")
   (let ((result (make-array (file-length stream))))
     (read-sequence result stream)
     (setf *file-string* result)))

This works unless there are weird characters in the file such as
#\LATIN_CAPITAL_LETTER_N_WITH_TILDE. Then I get:

*** - invalid byte #x81 in CHARSET:CP1252 conversion

I gather this has something to do with Unicode character conversion. So
looking in the Clisp documentation I find that I need to have unicode
enabled. However, I think I do since char-code-limit is 1114112.

Any help appreciated.

Niklas Kambeitz

Re: Reading Unicode - invalid byte Sam Steingold
- Re: Reading Unicode - invalid byte Niklas Kambeitz
Re: Reading Unicode - invalid byte Philippe Brochard

From: Sam Steingold
Subject: Re: Reading Unicode - invalid byte
Date: Wed, 16 Feb 2005 18:34:00 +0000
Message-ID: <u3bvwuzfr.fsf@gnu.org>

> * Niklas Kambeitz <······@cb-obk.zptvyy.pn> [2005-02-16 18:16:06 +0000]:
>
> How do I read files with weird characters in it?
>
> Using clisp 2.32 with XEmacs on Windows XP, I'm trying to read a file
> into a string using this:
>
> (with-open-file (stream "sample.file")
>    (let ((result (make-array (file-length stream))))
>      (read-sequence result stream)
>      (setf *file-string* result)))
>
> This works unless there are weird characters in the file such as
> #\LATIN_CAPITAL_LETTER_N_WITH_TILDE. Then I get:
>
> *** - invalid byte #x81 in CHARSET:CP1252 conversion

clisp home -> FAQ -> trouble -> invalid byte ==>
<http://clisp.cons.org/faq.html#enc-err> ==>
<http://clisp.cons.org/clisp.html#opt-enc> ==>
<http://clisp.cons.org/impnotes.html#def-file-enc>
<http://clisp.cons.org/impnotes.html#extfmt>

in short, you should either use a 1:1 encoding,
e.g., charset:iso-8859-1,
or the specific encoding in which your file has been written.

  (with-open-file (stream "sample.file" :external-format charset:iso-8859-1)
    (let ((result (make-array (file-length stream))))
      (read-sequence result stream)
      (setf *file-string* result)))

-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/>
<http://www.mideasttruth.com/> <http://www.honestreporting.com>
Between grand theft and a legal fee, there only stands a law degree.

From: Niklas Kambeitz
Subject: Re: Reading Unicode - invalid byte
Date: Wed, 16 Feb 2005 18:58:57 +0000
Message-ID: <RJMQd.23223$6f.16982@charlie.risq.qc.ca>

Sam Steingold wrote:

>>*** - invalid byte #x81 in CHARSET:CP1252 conversion

> in short, you should either use a 1:1 encoding,
> e.g., charset:iso-8859-1,
> or the specific encoding in which your file has been written.
> 
>   (with-open-file (stream "sample.file" :external-format charset:iso-8859-1)
>     (let ((result (make-array (file-length stream))))
>       (read-sequence result stream)
>       (setf *file-string* result)))

This worked like a charm. Thanks!

From: Philippe Brochard
Subject: Re: Reading Unicode - invalid byte
Date: Wed, 16 Feb 2005 18:23:12 +0000
Message-ID: <87r7jgmkj3.fsf@grigri.elcforest>

Niklas Kambeitz writes:

> How do I read files with weird characters in it?
>
> Using clisp 2.32 with XEmacs on Windows XP, I'm trying to read a file
> into a string using this:
>
> (with-open-file (stream "sample.file")
>    (let ((result (make-array (file-length stream))))
>      (read-sequence result stream)
>      (setf *file-string* result)))
>
> This works unless there are weird characters in the file such as
> #\LATIN_CAPITAL_LETTER_N_WITH_TILDE. Then I get:
>
> *** - invalid byte #x81 in CHARSET:CP1252 conversion
>
> I gather this has something to do with Unicode character conversion. So
> looking in the Clisp documentation I find that I need to have unicode
> enabled. However, I think I do since char-code-limit is 1114112.
>
> Any help appreciated.
>
Maybe you can have a look to:

      http://clisp.cons.org/faq.html#enc-err

-- 
Philippe Brochard    <···········@SPAM_free.fr>
                      http://hocwp.free.fr