Unicode file in CLISP

From: proton
Subject: Unicode file in CLISP
Date: Sun, 27 Jan 2008 19:23:14 +0000
Message-ID: <d3809342-a501-4191-94a6-a707ceb7384b@e23g2000prf.googlegroups.com>

Please, all the experts out there, can anyone give me an answer
understandable for a newbie who's getting really frustrated with
something apparently really simple:

I am trying to load a file in CLISP which contains some Unicode
characters. CLISP chokes on some of these characters and gives the
error:
*** - invalid byte #x8d in CHARSET:CP1252 conversion

If I change the file encoding:
(setf *default-file-encoding* (make-encoding :charset charset:utf-8))
it still chokes, but the error message changes slightly:
*** - invalid byte #xFA in CHARSET:UTF-8 conversion, not a Unicode-16

I've tried also with the *terminal-encoding*, and still the same.
The file is UTF-8, of this I'm sure.

Does anyone know what to do to load this file, and explain it with
step-by-step commands, so that I can follow? I've read the CLISP
implementation notes, but there are no examples there, and I get
immediately lost.
Thanks a lot.

Re: Unicode file in CLISP Pascal Bourguignon
- Re: Unicode file in CLISP proton
  - Re: Unicode file in CLISP proton
    - Re: Unicode file in CLISP Pascal Bourguignon
      - Re: Unicode file in CLISP proton
        Re: Unicode file in CLISP Pascal J. Bourguignon

From: Pascal Bourguignon
Subject: Re: Unicode file in CLISP
Date: Sun, 27 Jan 2008 19:30:30 +0000
Message-ID: <87ir1frrqh.fsf@thalassa.informatimago.com>

proton <··········@gmail.com> writes:

> Please, all the experts out there, can anyone give me an answer
> understandable for a newbie who's getting really frustrated with
> something apparently really simple:
>
> I am trying to load a file in CLISP which contains some Unicode
> characters. CLISP chokes on some of these characters and gives the
> error:
> *** - invalid byte #x8d in CHARSET:CP1252 conversion
>
> If I change the file encoding:
> (setf *default-file-encoding* (make-encoding :charset charset:utf-8))
> it still chokes, but the error message changes slightly:
> *** - invalid byte #xFA in CHARSET:UTF-8 conversion, not a Unicode-16
>
> I've tried also with the *terminal-encoding*, and still the same.
> The file is UTF-8, of this I'm sure.

But you're wrong.  
#xFA is not a code you can find in a UTF-8 byte sequence.
http://en.wikipedia.org/wiki/Utf-8


> Does anyone know what to do to load this file, and explain it with
> step-by-step commands, so that I can follow? I've read the CLISP
> implementation notes, but there are no examples there, and I get
> immediately lost.

It's hard to tell you what to do, because you're asking something that
has no meaning.  You have a binary file.  What exactly do you want to
do with it?


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

This universe shipped by weight, not volume.  Some expansion may have
occurred during shipment.

From: proton
Subject: Re: Unicode file in CLISP
Date: Sun, 27 Jan 2008 19:43:05 +0000
Message-ID: <6b601f8e-9dd7-42d4-a877-8a6a1c6fa3b1@h11g2000prf.googlegroups.com>

On Jan 27, 8:30 pm, Pascal Bourguignon <····@informatimago.com> wrote:
> proton <··········@gmail.com> writes:
> > Please, all the experts out there, can anyone give me an answer
> > understandable for a newbie who's getting really frustrated with
> > something apparently really simple:
>
> > I am trying to load a file in CLISP which contains some Unicode
> > characters. CLISP chokes on some of these characters and gives the
> > error:
> > *** - invalid byte #x8d in CHARSET:CP1252 conversion
>
> > If I change the file encoding:
> > (setf *default-file-encoding* (make-encoding :charset charset:utf-8))
> > it still chokes, but the error message changes slightly:
> > *** - invalid byte #xFA in CHARSET:UTF-8 conversion, not a Unicode-16
>
> > I've tried also with the *terminal-encoding*, and still the same.
> > The file is UTF-8, of this I'm sure.
>
> But you're wrong.
> #xFA is not a code you can find in a UTF-8 byte sequence.http://en.wikipedia.org/wiki/Utf-8
>
> > Does anyone know what to do to load this file, and explain it with
> > step-by-step commands, so that I can follow? I've read the CLISP
> > implementation notes, but there are no examples there, and I get
> > immediately lost.
>
> It's hard to tell you what to do, because you're asking something that
> has no meaning.  You have a binary file.  What exactly do you want to
> do with it?
>
> --
> __Pascal Bourguignon__                    http://www.informatimago.com/
>
> This universe shipped by weight, not volume.  Some expansion may have
> occurred during shipment.

No, it's not a binary file. It is a lisp file in UTF-8. I've opened it
with a binary editor and there is no such FA byte in it, it's just the
message CLISP returns.

From: proton
Subject: Re: Unicode file in CLISP
Date: Sun, 27 Jan 2008 20:18:34 +0000
Message-ID: <6db0c0e9-f512-493d-a816-8d53ad5a8868@i7g2000prf.googlegroups.com>

On Jan 27, 8:43 pm, proton <··········@gmail.com> wrote:
> On Jan 27, 8:30 pm, Pascal Bourguignon <····@informatimago.com> wrote:
>
>
>
> > proton <··········@gmail.com> writes:
> > > Please, all the experts out there, can anyone give me an answer
> > > understandable for a newbie who's getting really frustrated with
> > > something apparently really simple:
>
> > > I am trying to load a file in CLISP which contains some Unicode
> > > characters. CLISP chokes on some of these characters and gives the
> > > error:
> > > *** - invalid byte #x8d in CHARSET:CP1252 conversion
>
> > > If I change the file encoding:
> > > (setf *default-file-encoding* (make-encoding :charset charset:utf-8))
> > > it still chokes, but the error message changes slightly:
> > > *** - invalid byte #xFA in CHARSET:UTF-8 conversion, not a Unicode-16
>
> > > I've tried also with the *terminal-encoding*, and still the same.
> > > The file is UTF-8, of this I'm sure.
>
> > But you're wrong.
> > #xFA is not a code you can find in a UTF-8 byte sequence.http://en.wikipedia.org/wiki/Utf-8
>
> > > Does anyone know what to do to load this file, and explain it with
> > > step-by-step commands, so that I can follow? I've read the CLISP
> > > implementation notes, but there are no examples there, and I get
> > > immediately lost.
>
> > It's hard to tell you what to do, because you're asking something that
> > has no meaning.  You have a binary file.  What exactly do you want to
> > do with it?
>
> > --
> > __Pascal Bourguignon__                    http://www.informatimago.com/
>
> > This universe shipped by weight, not volume.  Some expansion may have
> > occurred during shipment.
>
> No, it's not a binary file. It is a lisp file in UTF-8. I've opened it
> with a binary editor and there is no such FA byte in it, it's just the
> message CLISP returns.

OK, I solved it. I post here the answer so that people will stop
looking for it and, second, so that anyone can see about it if they
run into the same problem:
the moment I load a file in UTF-8 all successive files are expected to
be in the same encoding. So if the first file loads another file, the
latter will cause an error if it's not in UTF-8. The error message I
was getting came from the second file, not the first one, therefore
the confusion.
Thanks a lot to anyone who tried to help.

From: Pascal Bourguignon
Subject: Re: Unicode file in CLISP
Date: Mon, 28 Jan 2008 00:19:26 +0000
Message-ID: <87ejc2ssxd.fsf@thalassa.informatimago.com>

proton <··········@gmail.com> writes:

> On Jan 27, 8:43 pm, proton <··········@gmail.com> wrote:
>> On Jan 27, 8:30 pm, Pascal Bourguignon <····@informatimago.com> wrote:
>>
>>
>>
>> > proton <··········@gmail.com> writes:
>> > > Please, all the experts out there, can anyone give me an answer
>> > > understandable for a newbie who's getting really frustrated with
>> > > something apparently really simple:
>>
>> > > I am trying to load a file in CLISP which contains some Unicode
>> > > characters. CLISP chokes on some of these characters and gives the
>> > > error:
>> > > *** - invalid byte #x8d in CHARSET:CP1252 conversion
>>
>> > > If I change the file encoding:
>> > > (setf *default-file-encoding* (make-encoding :charset charset:utf-8))
>> > > it still chokes, but the error message changes slightly:
>> > > *** - invalid byte #xFA in CHARSET:UTF-8 conversion, not a Unicode-16
>>
>> > > I've tried also with the *terminal-encoding*, and still the same.
>> > > The file is UTF-8, of this I'm sure.
>>
>> > But you're wrong.
>> > #xFA is not a code you can find in a UTF-8 byte sequence.http://en.wikipedia.org/wiki/Utf-8
> [...]
> OK, I solved it. I post here the answer so that people will stop
> looking for it and, second, so that anyone can see about it if they
> run into the same problem:
> the moment I load a file in UTF-8 all successive files are expected to
> be in the same encoding. So if the first file loads another file, the
> latter will cause an error if it's not in UTF-8. The error message I
> was getting came from the second file, not the first one, therefore
> the confusion.
> Thanks a lot to anyone who tried to help.

When you break in the debugger, you can use *LOAD-PATHNAME*  to know
what file exactly it is reading.

Your assertion that:
> the moment I load a file in UTF-8 all successive files are expected to
> be in the same encoding. 
is wrong.  

LOAD uses the :DEFAULT external-format by default.  In clisp, the
meaning of :DEFAULT depends on CUSTOM:*DEFAULT-FILE-ENCODING*.

C/USER[13]> custom:*default-file-encoding*
#<ENCODING CHARSET:UTF-8 :UNIX>
C/USER[14]> (load "/tmp/a.lisp" :external-format charset:ascii)
;; Loading file /tmp/a.lisp ...
#P"/tmp/a.lisp" 
#<ENCODING CHARSET:UTF-8 :UNIX> 
#<ENCODING CHARSET:ASCII :UNIX> 
;;  Loading file /tmp/b.lisp ...
#P"/tmp/b.lisp" 
#<ENCODING CHARSET:UTF-8 :UNIX> 
#<ENCODING CHARSET:UTF-8 :UNIX> 
;;  Loaded file /tmp/b.lisp
;; Loaded file /tmp/a.lisp
T
C/USER[15]> (cat "/tmp/a.lisp")
(print *load-pathname*)
(print custom:*default-file-encoding*)
(print (stream-external-format SYSTEM::*LOAD-INPUT-STREAM*))
(load "/tmp/b.lisp")
C/USER[16]> (cat "/tmp/b.lisp")
(print *load-pathname*)
(print custom:*default-file-encoding*)
(print (stream-external-format SYSTEM::*LOAD-INPUT-STREAM*))
C/USER[17]> 

The moment you load a file with a given external-format, the other
files loaded recursively won't take into account the given
external-format.  The more so the other files loaded after.  

The moment you load a file with CUSTOM:*DEFAULT-FILE-ENCODING* bound
to a given external-format (with ext:letf given that
CUSTOM:*DEFAULT-FILE-ENCODING* is a symbol macro), the other files
loaded recursively will indeed use the same external-format as long as
they are loaded with a LOAD for that uses :DEFAULT as external-format.
But any file loaded thereafter will use the previous binding of
CUSTOM:*DEFAULT-FILE-ENCODING*.

The only case would be if you assigned a new encoding to
CUSTOM:*DEFAULT-FILE-ENCODING* (with SETF or SETQ), but then why would
you be suprized that all the files thereafter opened would use that
encoding by default?

In short,  to get the behavior you're complaining, all the files
loaded recursively would have to specify explicitely the
external-format, with forms such as:

(load "other-file.lisp"
   :external-format (stream-external-format
                       #+clisp system::*load-input-stream*
                       #-clisp (error "How else?"))

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

THIS IS A 100% MATTER PRODUCT: In the unlikely event that this
merchandise should contact antimatter in any form, a catastrophic
explosion will result.

From: proton
Subject: Re: Unicode file in CLISP
Date: Mon, 28 Jan 2008 15:22:59 +0000
Message-ID: <e5d1e768-8f75-4e1a-af99-9e78612c0f45@e23g2000prf.googlegroups.com>

Pascal, thank you so much for your help. There are still a few things
I am not sure about. For example, what is the difference between
CUSTOM:*DEFAULT-FILE-ENCODING* and simply *DEFAULT-FILE-ENCODING*.
Both seem to work, and I can't tell the difference.

Further, you are right about the encoding of each file being based on
this variable and not on the encoding of the first file. What got me
completely confused in my case was that (I hope I get the explanation
right this time) the first statement of the first file was:

(setf *DEFAULT-FILE-ENCODING* (make-encoding :charset charset:utf-8))

With this, all subsequent UTF-8 files were loaded correctly, but not
the very first one, since the loading had been already launched before
reading this statement. Now, to make things more misleading, the
second time I tried to load the file it would work, because obviously
the statement had taken effect and then the file could be read
correctly. This got me totally confused. I hope this explanation is
correct, because otherwise it would mean that I didn't understand a
thing. :<

There is yet another thing I don't understand is why *DEFAULT-FILE-
ENCODING* is defined as a macro symbol, when apparently all it
contains is the encoding value. Wouldn't it have been enough to simply
setq it?
Again, thank you so much for your great help.

From: Pascal J. Bourguignon
Subject: Re: Unicode file in CLISP
Date: Mon, 28 Jan 2008 16:02:27 +0000
Message-ID: <7cir1e7xbg.fsf@pbourguignon.anevia.com>

proton <··········@gmail.com> writes:

> Pascal, thank you so much for your help. There are still a few things
> I am not sure about. For example, what is the difference between
> CUSTOM:*DEFAULT-FILE-ENCODING* and simply *DEFAULT-FILE-ENCODING*.
> Both seem to work, and I can't tell the difference.

Try:

(eq 'CUSTOM:*DEFAULT-FILE-ENCODING* '*DEFAULT-FILE-ENCODING*)

if you get T, then they're the same.

There may be a difference if you are in a package that doesn't import
CUSTOM.  In clisp, by default, COMMON-LISP-USER imports it, so there
would be no difference.  

But as soon as you make your own package, without mentionning CUSTOM,
you don't have dirrect access to these symbols anymore, so it's better
to qualify them always.

> Further, you are right about the encoding of each file being based on
> this variable and not on the encoding of the first file. What got me
> completely confused in my case was that (I hope I get the explanation
> right this time) the first statement of the first file was:
>
> (setf *DEFAULT-FILE-ENCODING* (make-encoding :charset charset:utf-8))

Yes, in this case, the behavior you've observed would be explained.

I don't think it's a good idea to put such a form in a lisp file (for
the very reason you described, that setting being sticky).

> With this, all subsequent UTF-8 files were loaded correctly, but not
> the very first one, since the loading had been already launched before
> reading this statement. Now, to make things more misleading, the
> second time I tried to load the file it would work, because obviously
> the statement had taken effect and then the file could be read
> correctly. This got me totally confused. I hope this explanation is
> correct, because otherwise it would mean that I didn't understand a
> thing. :<

I've got code somewhere that reads the header and trailer of a source
file as bytes, scans it for -*- coding:... -*- or the equivalent
bottom local variable comment, and then close the file and loads it
with the correct encoding.  Here it is:
http://darcs.informatimago.com/lisp/common-lisp/make-depends.lisp
(starting from "Emacs File Local Variables")  See how it's used by
scan-source-file in the same file.  A nice little library is
struggling to get out of it...

Anyways, the point is that it's easier to deal with sources if they
are all in the same encoding. (Something that system (or world)
packagers should mind perhaps)  Otherwise, you should determine the
encoding explicitely, and tell LOAD what encoding to use, with the
:EXTERNAL-FORMAT parameter.

> There is yet another thing I don't understand is why *DEFAULT-FILE-
> ENCODING* is defined as a macro symbol, when apparently all it
> contains is the encoding value. Wouldn't it have been enough to simply
> setq it?

Implementation details.  Since clisp is written in C, I guess they
used a push model here rather than pull, meaning that when we change
the setting, they convert and store the encoding lisp object into C
data to be used by the rest of the code later.  Hence the low level
API here is defined in terms of setter and getter functions, instead
of a global variable.

C/CL-USER[150]> (macroexpand-1 '(setf custom:*default-file-encoding* charset:utf-8))
(SYSTEM::SET-DEFAULT-FILE-ENCODING CHARSET:UTF-8) ;
T
C/CL-USER[151]> (macroexpand-1 'custom:*default-file-encoding*)
(SYSTEM::DEFAULT-FILE-ENCODING) ;
T

The alternative would be to modify the C code to access and use the
lisp object everytime it needs it, perhaps with a little cache.

> Again, thank you so much for your great help.

-- 
__Pascal Bourguignon__