Unicode and ANSI Common Lisp

From: Adam Warner
Subject: Unicode and ANSI Common Lisp
Date: Wed, 15 Dec 2004 23:19:07 +0000
Message-ID: <pan.2004.12.15.23.19.05.484352@consulting.net.nz>

Hi all,

I hope you've been following the interesting thread entitled "CLisp case
sensitivity". There is debate over how to count Unicode characters, which
impacts upon the length of a string, individual character access and the
algorithms required to return these values.

This is a brilliant summary of the choices available to implementors:
<http://www.unicode.org/faq/char_combmark.html#7>

There appears to be growing consensus that #2, implementing strings as
sequences of Unicode code points, is the internal format that best
corresponds with the ANSI Common Lisp standard. Anything less exposes the
internal encoding of code points. Anything more (grapheme clusters) is
better left to a higher layer above ANSI Common Lisp.

What I hope people appreciate is that this really matters. Agreeing upon
how characters are counted is the only way for LENGTH to return the same
values across implementations for the same external strings. And this has
profound implications for how CHAR and SETF CHAR operate. It essentially
imposes a minimum internal format upon implementations: strings as
arrays of at least 21-bit values.

In discussing UTF-16 Peter Seibel commented:

   Yes, I'm with you so far. That just means that LENGTH has to be
   implemented in a smarter way--it has to scan the array of code-points
   looking for surrogate pairs in order to determine how many characters
   are in the string. (That Java's String.length() method doesn't do this
   will no doubt cause no end of problems down the line.)

This isn't the biggest issue. The biggest is that ANSI Common Lisp strings
are mutable and there is no way to insert a surrogate pair within a single
16-bit position. Java has immutable strings that already have to be
converted to mutable int arrays to modify code points.

A big issue with grapheme clusters is that CHAR-CODE-LIMIT is essentially
unlimited. You can't put a sensible number upon it. You could try to limit
it to 63-bits (3 21-bit code points) but there's still the possibility
that someone comes up with a sensible grapheme cluster that exceeds a
sequence of three code points.

Agreement is likely that a CHAR-CODE-LIMIT of #x110000 (2^16+2^20) is the
most sensible way to implement Unicode 4.0.1+ in ANSI Common Lisp. As the
UTF-16 encoding cuts off the possibility of any higher code point values
(surrogate pairs only provide for a maximum of 2^20 extra values) this
limit should be sufficient for generations.

The ultimate implication is there are only two conforming ANSI Common Lisp
implementations with respect to Unicode 4.0.1+: CLISP and SBCL. The other
ones return incorrect values for LENGTH, CHAR, CHAR-CODE, etc. for
particular /external/ Unicode character sequences.

Another implication is an ANSI Common Lisp implementing 16-bit characters
is only conforming with respect to Unicode 3.0. Unicode 3.1 assigned the
first supplementary characters that cannot be mutated correctly within a
16-bit character Lisp implementation and cannot be created via a code
point argument supplied to CODE-CHAR.

Regards,
Adam

Re: Unicode and ANSI Common Lisp Marcin 'Qrczak' Kowalczyk
- Re: Unicode and ANSI Common Lisp Adam Warner
Re: Unicode and ANSI Common Lisp Cameron MacKinnon
- Re: Unicode and ANSI Common Lisp Adam Warner

From: Marcin 'Qrczak' Kowalczyk
Subject: Re: Unicode and ANSI Common Lisp
Date: Thu, 16 Dec 2004 02:24:12 +0000
Message-ID: <87hdmnrneb.fsf@qrnik.zagroda>

Adam Warner <······@consulting.net.nz> writes:

> There appears to be growing consensus that #2, implementing strings
> as sequences of Unicode code points, is the internal format that
> best corresponds with the ANSI Common Lisp standard.

I wholeheartedly agree. But I don't see a consensus: people either
use UTF-16 or claim that grapheme clusters would be appropriate
(advocating the second option is generally not followed by a practical
implementation). I mean langauges in general, not just Lisp.

Having #x110000 characters leads to the question of a more efficient
string representation than UTF-32; most strings in practice are ASCII.
Clisp uses UCS-2 and ISO-8859-1 for strings which happen to fit.

This is complicated by the fact that strings are mutable and thus can
be upgraded to a wider representation. You have to either introduce an
indirection to all strings, or have a way to resize an object almost
in place by inserting a forwarding pointer by some GC-related magic.
I think clisp does the latter.

Should the character code space leave a hole for isolated surrogates?
They don't have a place as abstract characters and they are not
serializable as UTF-8/16/32. OTOH rejecting them probably doesn't
buy anything concrete in practice, except losing the property of
serializing arbitrary strings to UTF-8/16/32.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From: Adam Warner
Subject: Re: Unicode and ANSI Common Lisp
Date: Thu, 16 Dec 2004 03:13:02 +0000
Message-ID: <pan.2004.12.16.03.12.56.663882@consulting.net.nz>

Hi Marcin 'Qrczak' Kowalczyk,

> Having #x110000 characters leads to the question of a more efficient
> string representation than UTF-32; most strings in practice are ASCII.
> Clisp uses UCS-2 and ISO-8859-1 for strings which happen to fit.
> 
> This is complicated by the fact that strings are mutable and thus can
> be upgraded to a wider representation. You have to either introduce an
> indirection to all strings, or have a way to resize an object almost
> in place by inserting a forwarding pointer by some GC-related magic.
> I think clisp does the latter.

Thanks for this explanation of how complicated design can permit a more
compact internal representation than UTF-32!

> Should the character code space leave a hole for isolated surrogates?
> They don't have a place as abstract characters and they are not
> serializable as UTF-8/16/32. OTOH rejecting them probably doesn't buy
> anything concrete in practice, except losing the property of serializing
> arbitrary strings to UTF-8/16/32.

There's a discussion in 2.7 of the Unicode standard:
<http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf>

   ... There are a number of techniques for dealing with an isolated
   surrogate, such as omitting it, or converting it into U+FFFD
   replacement character to produce well-formed UTF-16, or simply halting
   the processing of the string with an error. For more information on
   this topic, see Unicode Technical Report #22, Character Mapping Markup
   Language (CharMapML).

I suspect just passing them through is not a good option as it permits
back channel information transfer (like the issue with overly long UTF-8
sequences). If you don't omit isolated surrogates or convert them to
something valid and visible someone could send a separate message encoded
within the isolated surrogates.

Regards,
Adam

From: Cameron MacKinnon
Subject: Re: Unicode and ANSI Common Lisp
Date: Thu, 16 Dec 2004 12:09:38 +0000
Message-ID: <jLGdndWeuvm25lzcRVn-rg@golden.net>

Adam Warner wrote:
> Hi all,
> 
> I hope you've been following the interesting thread entitled "CLisp case
> sensitivity". There is debate over how to count Unicode characters, which
> impacts upon the length of a string, individual character access and the
> algorithms required to return these values.
> 
> This is a brilliant summary of the choices available to implementors:
> <http://www.unicode.org/faq/char_combmark.html#7>
> 
> There appears to be growing consensus that #2, implementing strings as
> sequences of Unicode code points, is the internal format that best
> corresponds with the ANSI Common Lisp standard. Anything less exposes the
> internal encoding of code points. Anything more (grapheme clusters) is
> better left to a higher layer above ANSI Common Lisp.

I don't know much about Unicode and wide character issues, so if I 
appear to be grossly misinformed...

Required are:
	- conversion functions, minimally bytes/octets <=> unistrings
	- length-in-codepoints, length-in-graphemes
	- codepoint and grapheme access/mutator functions

For any given sane internal unistring format, the above functions will 
run in a mix of O(1), O(log n) and O(n).

For any given application's mix of data and usage, some internal 
representation will be optimal.

If you put a bright line around a single abstraction level (such as 
codepoints), CL implementers are likely write code that runs efficiently 
at that level of abstraction. This may not be suitable as a base for a 
grapheme based application. Will the writer of a grapheme abstraction 
build it on top of the codepoint based layer, paying double at 
conversion time, or will he be tempted to build directly on top of raw 
bytes?

By saying "Anything more (grapheme clusters) is better left to a higher 
layer above ANSI Common Lisp", you are dooming programmers of grapheme 
based applications to the bleak choice of writing their own 
abstractions, or searching among a half-dozen incompatible libraries 
that all do more or less the same thing with different licensing terms. 
Maybe this is reasonable, if the vast majority of unicode programs don't 
deal with graphemes.

The ideal unicode implementation would, I presume, sport an API about 
which coders would say "These people really understood the issues; 
they've provided exactly what I need" - regardless of whether they 
needed a codepoint or grapheme based abstraction.

This implies multiple internal representations, I guess. It doesn't 
sound fun or elegant in its simplicity, but if the end result is a 
computer language more capable of processing/comprehending the structure 
of written human languages, progress will have been made.

> It essentially
> imposes a minimum internal format upon implementations: strings as
> arrays of at least 21-bit values.

Bleh. DO consider the CPU cache pressure that widening all of your 
characters imposes. You can mispredict a lot of branches in the time it 
takes to fetch a cache miss. Hardware designers' efforts at doubling our 
bus widths will do us no good if the character set gurus insist on 
halving the average amount of information encoded in each bit.

From: Adam Warner
Subject: Re: Unicode and ANSI Common Lisp
Date: Thu, 16 Dec 2004 13:34:24 +0000
Message-ID: <pan.2004.12.16.13.34.21.893923@consulting.net.nz>

Hi Cameron MacKinnon,

> If you put a bright line around a single abstraction level (such as 
> codepoints), CL implementers are likely write code that runs efficiently 
> at that level of abstraction. This may not be suitable as a base for a 
> grapheme based application. Will the writer of a grapheme abstraction 
> build it on top of the codepoint based layer, paying double at 
> conversion time, or will he be tempted to build directly on top of raw 
> bytes?

A grapheme is a sequence of particular code points, a.k.a. a string at
this level of abstraction.

A grapheme abstraction will be a sequence of strings, a.k.a. a vector of
strings.

For random access to particular grapheme clusters this will be no less
efficient than if implementors did it themselves. Graphemes are by their
nature variable width objects so a sequence of pointers to variable width
objects is the best you can do.

For added speed you could intern graphemes like symbols and then perform
EQ comparisons.

> By saying "Anything more (grapheme clusters) is better left to a higher 
> layer above ANSI Common Lisp", you are dooming programmers of grapheme 
> based applications to the bleak choice of writing their own 
> abstractions, or searching among a half-dozen incompatible libraries 
> that all do more or less the same thing with different licensing terms. 
> Maybe this is reasonable, if the vast majority of unicode programs don't 
> deal with graphemes.

See above. Common Lisp has good support for sequences of strings.

>> It essentially
>> imposes a minimum internal format upon implementations: strings as
>> arrays of at least 21-bit values.
> 
> Bleh. DO consider the CPU cache pressure that widening all of your 
> characters imposes. You can mispredict a lot of branches in the time it 
> takes to fetch a cache miss. Hardware designers' efforts at doubling our 
> bus widths will do us no good if the character set gurus insist on 
> halving the average amount of information encoded in each bit.

It will involve less cache misses than grapheme clusters in the soon
common situation of 64-bit pointers. In the 32-bit world implementing
strings as pointers to character objects would be a compelling
alternative. But throwing away twice as much space with 64-bit pointers
as the base character abstraction may be a hard sell.

Regards,
Adam