From: Erik Naggum
Subject: the character type
Date: 
Message-ID: <3036079150517560@arcana.naggum.no>
suppose you have a file in an unknown character set.  the file starts with
some codes that tell you how it is encoded.  each character could be 7, 8,
14, 16 or 21 bits wide, encoded as 1, 1, 2, 2, and 4 bytes, respectively.
additionally it could be using any of the ISO 2022 or ISO 10646 methods of
encoding.

suppose you want to read this into the system as _characters_, and work on
them as characters, for purposes of mapping, transformation, processing, or
interpreting where the encoding is irrelevant to the result, but you need
to generate some _encoding_ on output, which need not be a function of the
input encodings used.

I can fake it with integers, but I really want to distinguish characters
from integers.  I actually want characters to be known by longish, unique
names, and have several possible encodings.  I can fake this with symbols,
but symbols are not characters, and I think Common Lisp should have a
strong enough character type that it could do all this, not the least
because of the requirements of internationalization.

I would like to be able to have a number of character set descriptions
(tables) and encoding algorithms (filter functions) that allow me to return
the character read from an input source.  let me take a simple example.
suppose the character set is ISO 8859-1, but it has been reduced to 7-bit
encoding according to ISO 2022, such that SO and SI are used to switch
between the "low" and the "high" half.  the usual way to deal with this is
to let SO and SI toggle the 8th bit, but I don't want that.  I want SO and
SI to change the mapping for the current 7-bit character set such that the
right character is returned directly.  I want this because SO and SI are
not the only possible character set shift control characters in ISO 2022.

additionally, I want to be able to parse escape sequences as their
appropriate pseudo-characters, but this is above and beyond the others.

ideally, I would like to have a facility that allowed me to create new
characters, name them, and assign _multiple_ codes to them.  the key to my
quest is that it should be possible to read and write them using various
encoding schemes and to put them into strings and still to have the rest of
the Common Lisp system deal with them.

I am disappointed with the character type in Common Lisp -- it seems
incongruously inflexible -- and wonder if any work has been done in this
area.  e.g., are there Common Lisp implementations that allow GB5, UTF, JIS
X 0208, etc, or so-called "wide characters"?

I think I would like to see `(setf char-name)', `(setf char-code)', and/or
a (new) `make-char' function that allowed me to build new characters.

the font and bits stuff was a start in the right direction, although less
general than they should have been, but these were removed from the ANSI
standard.

any ideas?

#<Erik>
-- 
the Internet made me do it
From: David B. Lamkins
Subject: Re: the character type
Date: 
Message-ID: <dlamkins-1803960858550001@ip-pdx09-10.teleport.com>
In article <················@arcana.naggum.no>, Erik Naggum
<····@naggum.no> wrote:

> suppose you have a file in an unknown character set.  the file starts with
[...]

> I am disappointed with the character type in Common Lisp -- it seems
> incongruously inflexible -- and wonder if any work has been done in this
> area.  e.g., are there Common Lisp implementations that allow GB5, UTF, JIS
> X 0208, etc, or so-called "wide characters"?
> 
> I think I would like to see `(setf char-name)', `(setf char-code)', and/or
> a (new) `make-char' function that allowed me to build new characters.
> 
> the font and bits stuff was a start in the right direction, although less
> general than they should have been, but these were removed from the ANSI
> standard.

I believe that MCL has support for wide characters.

Since you can define your own types, you could certainly add this
functionality to an existing Lisp.  It might be a lot of work to override
the behavior of all existing string functions, so why not just build the
functionality which is specific to your application and not worry about
its absence in the ANSI standard?

-- 
Dave
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CPU Cycles: Use them now or lose them forever...
http://www.teleport.com/~dlamkins/