Unicode, UTF-16, etc..

From: ············@ma.ultranet.com
Subject: Unicode, UTF-16, etc..
Date: Wed, 22 Jul 1998 00:00:00 +0000
Message-ID: <35b65115.28058395@news.ma.ultranet.com>

What is the recommended technique dynamic management of character encodings in
common lisp?  

For instance, looking at the first bytes of an XML file to determine encoding
type can be done with unsigned byte streams.  If I determine that the encoding
is UTF-16, what is the recommended way of dealing with parsing and
interpretation of UTF-16 content in lisp?

(stream, character, equality issues, though if I knew what I was talking about,
I wouldn't be asking).


Dave Tenny
············@ma.ultranet.com - no spam please

Re: Unicode, UTF-16, etc.. Howard R. Stearns
- Re: Unicode, UTF-16, etc.. David Fox

From: Howard R. Stearns
Subject: Re: Unicode, UTF-16, etc..
Date: Mon, 27 Jul 1998 00:00:00 +0000
Message-ID: <35BCD0ED.F138C3E3@elwood.com>

Here's my understanding:

When you open a file (explicitly with with OPEN, or implicitly with LOAD
or COMPILE-FILE), you can specify the :EXTERNAL-FORMAT.

This defaults to :DEFAULT, which means that your Lisp implementation is
free to determine the correct format: ASCII, UTF-16, UTF-8, etc.  If it
is able to do this, you're done.  Otherwise, your implementation may
allow you to explicitly specify an external-format (:ASCII, :UTF,
:MULTI-BYTE), etc.

In order to clarify things for your implementation, you may need to also
specify :ELEMENT-TYPE.  This defaults to CHARACTER.  Some combinations
might not make sense.  For example, in a Lisp where BASE-CHAR and
EXENDED-CHAR are distinct, it might not make sense to specify
:ELEMENT-TYPE 'BASE-CHAR :EXTERNAL-FORMAT :UTF.  Such incongruity might
signal an error when the file is opened, or might not signal an error
until you tried to read a character that ended up actually being bigger
than a base-char.

Alas, my employer's Eclipse is the only CL I know that even tries to
support non-base extended-chars and the above formats.  You can read how
we handle these issues in our on-line doc: 
  http://www.elwood.com/eclipse/char.htm
  http://www.elwood.com/eclipse/open.htm
  http://www.elwood.com/eclipse/compile.htm (last two paragraphs)
(Eclipse isn't smart enough to determine the :EXTERNAL-FORMAT by
inspection, but is smart enough to use :ASCII when :ELEMENT-TYPE
BASE-CHAR is specified, and :MULTI-BYTE (UTF-8) when :ELEMENT-TYPE
CHARACTER is specified.  UTF-8 is identical to :ASCII when only ASCII
characters are used, but also handles non-ascii Unicode using multiple
byte encodings as needed.)

OK, now suppose you are using some other implementation that only
handles one format, which, whatever it may be called, is equivalent to
8-bit ASCII.  If, despite this, your implementation handles Unicode
characters internally, you could read the characters into a sequence
(using READ-SEQUENCE, READ-STRING, etc.) and then convert them using
CHAR-CODE, etc.  This is unlikely.

Finally, if all else fails, you can read BASE-CHAR sequences as above,
and then copy them to a specialized array of, say (unsigned-byte 16),
using CHAR-CODE, etc.  You'll have to check with what your
implementation's CHAR-CODE/CHAR-INT does with the 8th bit -- you don't
want to loose it.  If neither CHAR-CODE nor CHAR-INT preserve the 8th
bit, you should open the stream as :ELEMENT-TYPE (UNSIGNED-BYTE 16) and
then use the "binary" reader functions (READ-SEQUENCE, READ-BYTE). (This
may be a better idea anyway.)  

Although the result can't participate as a normal Lisp string in the CL
string and character operations, it may be good enough for passing to
operating system utilities through a foreign function interface.  For
example, instead of calling Lisp ALPHA-CHAR-P, you might be able to use
the C iswalpha().

There are (at least) three thing to keep in mind with this:

I. There may be big-endian/little-endian issues.  This applies both when
reading in the data and when calling OS or other library utilities. It
is conceivable that:
  1. The UTF format, your operating system/library utilities, and your
Lisp implementation all use the same endianness, or
  2. Your Lisp provides some handle to correct for this in both the
binary :EXTERNAL-FORMAT and in the foreign function interface.

II. On some platforms, the operating system might handle wide characters
as 2-bytes, while on others it might be 4-bytes: an operating system or
other library utility that maniplates UTF-16 want the arguments to be
arrays of wchar_t, which might be 2-bytes on some platforms, 4-bytes on
others.

III. Pathname namestrings (or strings in other operating system
interfaces such as getenv()).  For example, when dealing with the
operating system, Eclipse truncates bits higher than the low 8 of each
character of a pathname namestring.  For many Unix systems, this may be
the right thing.  For others, it would be nice if we converted to
multi-byte or used the operating system wide character version of
open(), etc.

············@ma.ultranet.com wrote:
> 
> What is the recommended technique dynamic management of character encodings in
> common lisp?
> 
> For instance, looking at the first bytes of an XML file to determine encoding
> type can be done with unsigned byte streams.  If I determine that the encoding
> is UTF-16, what is the recommended way of dealing with parsing and
> interpretation of UTF-16 content in lisp?
> 
> (stream, character, equality issues, though if I knew what I was talking about,
> I wouldn't be asking).
> 
> Dave Tenny
> ············@ma.ultranet.com - no spam please

From: David Fox
Subject: Re: Unicode, UTF-16, etc..
Date: Thu, 30 Jul 1998 00:00:00 +0000
Message-ID: <epyatbmvto.fsf@harlequin.co.uk>

"Howard R. Stearns" <······@elwood.com> writes:

> Alas, my employer's Eclipse is the only CL I know that even tries to
> support non-base extended-chars and the above formats.  You can read how
> we handle these issues in our on-line doc: 
>   http://www.elwood.com/eclipse/char.htm
>   http://www.elwood.com/eclipse/open.htm
>   http://www.elwood.com/eclipse/compile.htm (last two paragraphs)
> (Eclipse isn't smart enough to determine the :EXTERNAL-FORMAT by
> inspection, but is smart enough to use :ASCII when :ELEMENT-TYPE
> BASE-CHAR is specified, and :MULTI-BYTE (UTF-8) when :ELEMENT-TYPE
> CHARACTER is specified.  UTF-8 is identical to :ASCII when only ASCII
> characters are used, but also handles non-ascii Unicode using multiple
> byte encodings as needed.)
> 

LispWorks 4 also supports non-base EXTENDED-CHARs, and multibyte and
widechar file streams. 

It uses an extensible algorithm for determining the encoding in a
file, and this is configurable.  For instance, by default it assumes
that files are Latin-1 or Unicode, but simple configurations enable
eg.  defaultly using MS code pages, or detection of common Japanese
encodings, instead. The user should be able to extend the algorithm so
that OPEN detects encodings from an XML header.

Currently we support ASCII, LATIN-1, EUC-JP, Shift-JIS, UNICODE, and
Windows code page encodings. From this thread, it looks like we should
provide UTF-8 and UTF-16 as well.

The default algorithm also handles LF, CRLF line termination.

OPEN checks for consistency between the external-format and
element-type. Because LispWorks external-formats have a well-defined
notion of the associated lisp element type, the most flexible idiom
is:

(OPEN <file> :ELEMENT-TYPE :DEFAULT ...)

yielding a stream with STREAM-ELEMENT-TYPE determined via the
external-format. The :EXTERNAL-FORMAT argument should be supplied if
known, but otherwise it's determined via a function
SYSTEM:GUESS-EXTERNAL-FORMAT.

You can see our on-line doc at:
http://www.harlequin.com/education/books/lww_doc.html

-- 
Dave Fox                                  Email: ·····@harlequin.com
Harlequin Ltd, Barrington Hall,           Tel:   +44 1223 873879
Barrington, Cambridge CB2 5RG, England.   Fax:   +44 1223 873873
These opinions are not necessarily those of Harlequin.