Lisp's other than CLISP that support full Unicode character repertoire?

From: Peter Seibel
Subject: Lisp's other than CLISP that support full Unicode character repertoire?
Date: Wed, 18 Aug 2004 17:00:36 +0000
Message-ID: <m3hdr0wfiu.fsf@javamonkey.com>

CLISP, as far as I know, uses Unicode code points as its internal
character codes and has a char-code-limit of #x110000, which accounts
for all the Unicode code points. Allegro 6.2 has a char-code-limit of
#x10000, i.e. 16-bit characters. SBCL, CMUCL, and OpenMCL have a
char-code-limit of 256. Dunno about Lispworks, GCL, MCL, or
ArmedBearCommonLisp. Are there Common Lisps other than CLISP whose
native character codes include all the Unicode code points? Or are any
of the non-Unicode Lisps likely to support Unicode in the near future?

-Peter

-- 
Peter Seibel                                      ·····@javamonkey.com

         Lisp is the red pill. -- John Fraser, comp.lang.lisp

Re: Lisp's other than CLISP that support full Unicode character repertoire? Edi Weitz
Re: Lisp's other than CLISP that support full Unicode character repertoire? Brian Downing
- Re: Lisp's other than CLISP that support full Unicode character repertoire? Raymond Toy
Re: Lisp's other than CLISP that support full Unicode character repertoire? Ray Dillinger
- Re: Lisp's other than CLISP that support full Unicode character repertoire? Marcin 'Qrczak' Kowalczyk
  - Unicode LISP?? Ray Dillinger
    - Re: Unicode LISP?? Marcin 'Qrczak' Kowalczyk
      - Re: Unicode LISP?? Ray Dillinger
    - Re: Unicode LISP?? Bruno Haible

From: Edi Weitz
Subject: Re: Lisp's other than CLISP that support full Unicode character repertoire?
Date: Wed, 18 Aug 2004 17:10:52 +0000
Message-ID: <87d61o4bj7.fsf@bird.agharta.de>

On Wed, 18 Aug 2004 17:00:36 GMT, Peter Seibel <·····@javamonkey.com> wrote:

> CLISP, as far as I know, uses Unicode code points as its internal
> character codes and has a char-code-limit of #x110000, which
> accounts for all the Unicode code points. Allegro 6.2 has a
> char-code-limit of #x10000, i.e. 16-bit characters. SBCL, CMUCL, and
> OpenMCL have a char-code-limit of 256. Dunno about Lispworks

LW also has #x10000.

Edi.

-- 

"Lisp doesn't look any deader than usual to me."
(David Thornley, reply to a question older than most languages)

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: Brian Downing
Subject: Re: Lisp's other than CLISP that support full Unicode character repertoire?
Date: Wed, 01 Sep 2004 14:30:45 +0000
Message-ID: <p2lZc.105348$mD.33342@attbi_s02>

In article <··············@javamonkey.com>,
Peter Seibel  <·····@javamonkey.com> wrote:
> CLISP, as far as I know, uses Unicode code points as its internal
> character codes and has a char-code-limit of #x110000, which accounts
> for all the Unicode code points. Allegro 6.2 has a char-code-limit of
> #x10000, i.e. 16-bit characters. SBCL, CMUCL, and OpenMCL have a
> char-code-limit of 256. Dunno about Lispworks, GCL, MCL, or
> ArmedBearCommonLisp. Are there Common Lisps other than CLISP whose
> native character codes include all the Unicode code points? Or are any
> of the non-Unicode Lisps likely to support Unicode in the near future?

The rumors on sbcl-devel and #lisp indicate that SBCL will be getting
full (21-bit) Unicode support in the near future.  (FSVO "near").

-bcd
-- 
*** Brian Downing <bdowning at lavos dot net>

From: Raymond Toy
Subject: Re: Lisp's other than CLISP that support full Unicode character repertoire?
Date: Wed, 01 Sep 2004 16:11:19 +0000
Message-ID: <sxd1xhmx93s.fsf@edgedsp4.rtp.ericsson.se>

>>>>> "Brian" == Brian Downing <·············@lavos.net> writes:

    Brian> In article <··············@javamonkey.com>,
    Brian> Peter Seibel  <·····@javamonkey.com> wrote:
    >> CLISP, as far as I know, uses Unicode code points as its internal
    >> character codes and has a char-code-limit of #x110000, which accounts
    >> for all the Unicode code points. Allegro 6.2 has a char-code-limit of
    >> #x10000, i.e. 16-bit characters. SBCL, CMUCL, and OpenMCL have a
    >> char-code-limit of 256. Dunno about Lispworks, GCL, MCL, or
    >> ArmedBearCommonLisp. Are there Common Lisps other than CLISP whose
    >> native character codes include all the Unicode code points? Or are any
    >> of the non-Unicode Lisps likely to support Unicode in the near future?

    Brian> The rumors on sbcl-devel and #lisp indicate that SBCL will be getting
    Brian> full (21-bit) Unicode support in the near future.  (FSVO "near").

CMUCL has a unicode branch from sometime ago.  I believe it's fairly
complete, but it has not been kept up-to-date (lack of developer time
and/or knowledge), and only worked for x86.

Ray

From: Ray Dillinger
Subject: Re: Lisp's other than CLISP that support full Unicode character repertoire?
Date: Thu, 02 Sep 2004 18:46:10 +0000
Message-ID: <STJZc.11632$54.161041@typhoon.sonic.net>

Peter Seibel wrote:
> CLISP, as far as I know, uses Unicode code points as its internal
> character codes and has a char-code-limit of #x110000, which accounts
> for all the Unicode code points. Allegro 6.2 has a char-code-limit of
> #x10000, i.e. 16-bit characters. SBCL, CMUCL, and OpenMCL have a
> char-code-limit of 256. Dunno about Lispworks, GCL, MCL, or
> ArmedBearCommonLisp. Are there Common Lisps other than CLISP whose
> native character codes include all the Unicode code points? Or are any
> of the non-Unicode Lisps likely to support Unicode in the near future?
> 

I think the "right thing" here is actually beyond 21-bit unicode.
Unicode codepoints, in many cases, are not characters. I think that
the "right thing" with unicode is to allow characters that are a
unicode base codepoint followed by any nondefective sequence of
combining codepoints.  If we can do that, then there's only about
a dozen ligatures and sharp-s that change the string length on a
case change.  And ligatures aren't "canonical" anyway; they're
supposed to be considerations at the rendering level, not the
character level. So sharp-s is really the only canonical character
that causes a problem if we adopt a broader view of characters.

Some implications:  Char-code-limit would return infinity.
Char-int might return a bignum.

Some questions implementors need to consider regardless of
whether they go with the "Codepoint=character" approach or the
"Glyph=character" approach:

     Codepoints within the unicode codespace but not assigned to a 
character by the unicode standard:  Should there be corresponding
objects in Lisp of type character?

     Codepoints within the unicode codespace but reserved by the
standard (the promise is made that there shall *NEVER* be such a
character):  Should there be corresponding objects in Lisp of type
character?

    NFC is a vastly larger character repertoire than (and a proper
superset of) NFKC.  How does a Lisp work properly with both?

				Bear

From: Marcin 'Qrczak' Kowalczyk
Subject: Re: Lisp's other than CLISP that support full Unicode character repertoire?
Date: Thu, 02 Sep 2004 21:55:49 +0000
Message-ID: <874qmgcp3u.fsf@qrnik.zagroda>

Ray Dillinger <····@sonic.net> writes:

> I think the "right thing" here is actually beyond 21-bit unicode.
> Unicode codepoints, in many cases, are not characters. I think that
> the "right thing" with unicode is to allow characters that are a
> unicode base codepoint followed by any nondefective sequence of
> combining codepoints.

I've seen lots of people saying this.

All of them said something like "I think it would be the right thing
to do". They didn't yet detail how to manipulate parts of characters
themselves, how code can talk about a combining character in isolation.

But all designs I've seen actually implemented use either code points,
or UTF-16 units, or UTF-8 units as elements of string representation.
Seems much simpler (in decreasing order of simplicity), and easier for
interoperability.

I think the right thing is to express more interfaces in terms of
strings rather than characters. Then characters can be code points
without too many problems.

There is no universal character boundary. Some algorithms need e.g.
grapheme cluster boundaries which are defined differently. You haven't
considered what to do with ZWJ and ZWNJ.

Code points are the smallest common denominator. Algorithms like case
mapping and collation are defined in terms of code points. I guess
most Lisps aren't capable of representing code point strings natively
yet. It's hard enough for them to accept that 256 characters are not
enough for everyone.

> If we can do that, then there's only about a dozen ligatures and
> sharp-s that change the string length on a case change.

Don't forget that some case mappings are contextual (e.g. sigma), so
even ignoring � string-downcase in Unicode is *not* equivalent to
mapping each character through char-downcase, or Greeks will be upset.

>     NFC is a vastly larger character repertoire than (and a proper
> superset of) NFKC.  How does a Lisp work properly with both?

Hint: it's much easier to apply a transformation when needed than to
undo a transformation which has been forced automatically.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From: Ray Dillinger
Subject: Unicode LISP??
Date: Sat, 04 Sep 2004 16:58:29 +0000
Message-ID: <Vum_c.12029$54.167810@typhoon.sonic.net>

Marcin 'Qrczak' Kowalczyk wrote:
> Ray Dillinger <····@sonic.net> writes:
> 
> 
>>I think the "right thing" here is actually beyond 21-bit unicode.
>>Unicode codepoints, in many cases, are not characters. I think that
>>the "right thing" with unicode is to allow characters that are a
>>unicode base codepoint followed by any nondefective sequence of
>>combining codepoints.
> 
> 
> I've seen lots of people saying this.
> 
> All of them said something like "I think it would be the right thing
> to do". They didn't yet detail how to manipulate parts of characters
> themselves, how code can talk about a combining character in isolation.
> 
> But all designs I've seen actually implemented use either code points,
> or UTF-16 units, or UTF-8 units as elements of string representation.
> Seems much simpler (in decreasing order of simplicity), and easier for
> interoperability.
> 
> I think the right thing is to express more interfaces in terms of
> strings rather than characters. Then characters can be code points
> without too many problems.

Okay...  If you were designing a LISP, from the ground up, to be
a fully unicode-aware language that "Does Unicode Right" what would
you do?

Things I've considered in various combinations: (I know,
some of these are mutually exclusive choices).

1) Combining codepoints in isolation are members of the
    character datatype, but, like control characters and
    characters with buckybits in CLTL2, they aren't
    string-characters; you can't put them into strings as
    independent characters. There are calls that add or
    remove combining codepoints to or from any character,
    making a string-char if the character they're attached
    to is a string-char. There are also compose-char and
    decompose-char functions that make and return the
    codepoint lists of any character.

2) Strings have multiple simulultaneous indices; they
    are normally indexed in terms of grapheme clusters,
    but with an optional argument you can specify that
    you want to use codepoint indexing instead.  There
    is a function that returns the grapheme-cluster
    index in which a particular codepoint index is found,
    and a function that retuns the codepoint index at
    which a particular grapheme-cluster index begins.

3) You could abandon any pretense at enforcing
    "legitimate" unicode structure on your strings.
    Let any character be a sequence of one or more
    codepoints of any kind, and strings be a vector
     of characters, and leave it to the programmer
    to say exactly what he means and keep the string
    and grapheme structure straight.  This implementation
    would come with an extensive collection of procedures
    to check for and identify problems with character
    and string "well-formed-ness" WRT unicode, but
    absolutely every manipulation would be done by
    explicit character and codepoint manipulation, and
    you only get an error if you try to _output_
    something that is ill-formed.  This gives the
    users the tools they need to build higher-level
    string and char libraries that behave in
    sensible ways without locking them in.

4) Because case-folding is a real bugger in Unicode,
    it might be more practical and in more graceful accord
    with the principle of least surprise to make such a
    LISP case-sensitive.  Case-sensitivity in character
    names also allows the established HTML4 names for
    the common set of international characters. (For
    example, #\Agrave and #\agrave could be different
    characters).


5) Strings stop being a subtype of "array."  A new
    type, "text", includes strings, graphemes, and
    codepoints.

From: Marcin 'Qrczak' Kowalczyk
Subject: Re: Unicode LISP??
Date: Sat, 04 Sep 2004 19:55:11 +0000
Message-ID: <87fz5xlsgw.fsf@qrnik.zagroda>

Ray Dillinger <····@sonic.net> writes:

> Okay...  If you were designing a LISP, from the ground up, to be
> a fully unicode-aware language that "Does Unicode Right" what would
> you do?

I'm not experienced with Common Lisp library, so it's hard to tell
where it's incompatible with Unicode.

One thing that I noted previously: case mapping should be defined in
terms of strings rather than characters.

Unfortunately this causes problems for the deeply hardwired case
insensitiveness, because ignoring case is no longer such a simple
thing. For example it would no longer be true that
   (string= (string-downcase s) (string-downcase (string-upcase s)))
which fails for strings containing "�", final small sigma, dotless i,
apostrophe-n, long s, Greek iota under letter, ligatures like "fi"
or other weird characters.

Unicode defines text mappings:
- upcasing
- downcasing
- titlecasing
- case folding
where case folding is the important one for case insensitive
comparison. If two strings can be brought into the same string by
other case operations, they case fold to the same. It's often the same
as lowercasing, but it differs from it for various special cases like
the above.

Neither lowercasing alone nor uppercasing alone is sufficient to fold
all case differences. Lowercasing alone fails for characters like
mentioned above. Uppercasing alone fails for capital I with dot above,
Greek capital theta symbol and some compatibility variants of capital
letters which don't have unique lowercase equivalents.

For me case sensitiveness in a programming language would be a good
choice, but Lisp tradition is being case insensitive.

String representation is not obvious. Let's assume for now that from
the programmer's point of view strings consist of code points.

If they are represented in UTF-8 or UTF-16, string indexing is not
O(1).

If they are represented in UTF-32, ASCII strings take 4 times more
space than byte-packed ASCII would take.

If they are represented in UTF-32 or ISO-8859-1, depending on whether
they contain some character above U+00FF, then strings may need to
have their representation upgraded if they are updated in place.

Some languages don't have this problem by making strings immutable and
using some other type for mutable strings (e.g. Python, Java, C#).
It's fine for me, but again Lisp tradition is to have mutable strings.

Anyway, if they need to be upgraded, there are two ways. Either
a string physically contains a pointer to characters instead of
characters themselves, or they require some garbage collector tricks
to be able to extend an object in place, perhaps by physically moving
it elsewhere and updating pointers pointing to it. The latter is what
CLisp does AFAIK, and it uses 3 string representations depending on
which characters are present: 8-, 16-, or 32-bit.

As I said, I don't believe that a "more abstract" representation than
a string of code points is feasible.

> 1) Combining codepoints in isolation are members of the
>     character datatype, but, like control characters and
>     characters with buckybits in CLTL2, they aren't
>     string-characters; you can't put them into strings as
>     independent characters.

If strings are not isomorphic to sequences of characters (whatever
exactly "characters" mean), I predict confusion and breakage. In about
any language which has characters as a dictinct type from strings,
strings are sequences of characters.

Programs usually work on strings consisting of "well-behaved"
"regular" characters, so bugs in this area would be often left
undetected until someone feeds the program with a text containing
rare characters in an unusual combination.

For example assume that a HTML file contains
   s&#803;
and the program resolves numeric character references to actual
characters, "combining dot below" in this case. A straightforward
implementation would try to put it as a character in a string.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From: Ray Dillinger
Subject: Re: Unicode LISP??
Date: Mon, 06 Sep 2004 05:47:50 +0000
Message-ID: <aSS_c.12208$54.171154@typhoon.sonic.net>

Marcin 'Qrczak' Kowalczyk wrote:
> Ray Dillinger <····@sonic.net> writes:

>>1) Combining codepoints in isolation are members of the
>>    character datatype, but, like control characters and
>>    characters with buckybits in CLTL2, they aren't
>>    string-characters; you can't put them into strings as
>>    independent characters.
> 
> 
> If strings are not isomorphic to sequences of characters (whatever
> exactly "characters" mean), I predict confusion and breakage. In about
> any language which has characters as a dictinct type from strings,
> strings are sequences of characters.

Well, the consideration in CLTL was that the "character"
datatype actually represented two different things.  Characters,
and keystrokes.  Alt-J is a keystroke.  Uppercase J is a
character. It's entirely reasonable to collect characters in
strings; but it's not reasonable to have "strings" of
keystrokes.

So CLTL had this distinction: "characters" as a datatype
included keystrokes, but only true characters (not keystrokes)
were supposed to be string-characters. CLTL2 contained
reference to this, but the committee decision was to allow
buckybits, font bits, and other stuff that could make something
into a non string-character to exist as "implementation defined
attributes" and strike the specification of that behavior from
the standard.

In a grapheme-based system, a combining codepoint by itself,
similarly, is an entity you might have to work with at times,
but it isn't a true character; it doesn't make sense to stick
it into strings by itself without a base character to modify.

Anyway, it was just one of many ideas.  I actually think I
prefer the system where the language _primitives_ allow one or
more codepoints per character and enforce absolutely nothing
about which codepoints they may be.  All that would come out
in library code for the UNICODE Character Set, and with
different libraries, you could work with the UNI-21 character
set where there is one codepoint per character, or the UNI-16
character set where there is one codepoint per character and
it's restricted to sixteen bits, or the LATIN-1 character set
where there's one codepoint per character and it's restricted
to 8 bits.

				Bear

From: Bruno Haible
Subject: Re: Unicode LISP??
Date: Mon, 06 Sep 2004 20:38:53 +0000
Message-ID: <chihst$mb5$1@laposte.ilog.fr>

Ray Dillinger <····@sonic.net> asked:
>
> Okay...  If you were designing a LISP, from the ground up, to be
> a fully unicode-aware language that "Does Unicode Right" what would
> you do?

1) Drop the character type; use strings of length 1 to represent
   character. (Like the ABC programming language did more than 10
   years ago.)
   Rationale:
     - Many string operations like string-upcase make sense only for
       entire strings, not for single characters.
     - For some purposes it is wrong to iterate through the Unicode
       codepoints of a string one by one.

2) Provide access to libraries that implement Unicode-compliant
   string-upcase/downcase/titlecase, regexp searching, canonicalization.


          Bruno