From: Xah Lee
Subject: languages with full unicode support
Date: 
Message-ID: <1151251736.590910.36050@u72g2000cwu.googlegroups.com>
TGFuZ3VhZ2VzIHdpdGggRnVsbCBVbmljb2RlIFN1cHBvcnQKCkFzIGZhciBhcyBpIGtub3csIEph
dmEgYW5kIEphdmFTY3JpcHQgYXJlIGxhbmd1YWdlcyB3aXRoIGZ1bGwsIGNvbXBsZXRlCnVuaWNv
ZGUgc3VwcG9ydC4gVGhhdCBpcywgdGhleSBhbGxvdyBuYW1lcyB0byBiZSBkZWZpbmVkIHVzaW5n
IHVuaWNvZGUuCih0aGUgSmF2YVNjcmlwdCBlbmdpbmUgdXNlZCBieSBGaXJlRm94IHN1cHBvcnQg
dGhpcykKCkFzIGZhciBhcyBpIGtub3csIGhlcmUncyBmZXcgb3RoZXIgbGFuZydzIHN0YXR1czoK
CkMg4oaSIE5vLgpQeXRob24g4oaSIE5vLgpQZXJsIOKGkiBOby4KSGFza2VsbCDihpIgWWVzIGJ5
IHRoZSBzcGVjLCBidXQgbm8gb24gZXhpc3RpbmcgY29tcGlsZXJzLgpKYXZhU2NyaXB0IOKGkiBO
byBpbiBnZW5lcmFsLiBGaXJlZm94J3MgZW5naW5lIGRvIHN1cHBvcnQgaXQuCkxpc3BzIOKGkiBO
by4KdW5peCBzaGVsbHMgKGJhc2gpICDihpIgTm8uICh0aGlzIHByb2JhYmx5IGFwcGxpZXMgdG8g
YWxsIHVuaXggc2hlbGxzKQpKYXZhIOKGkiBZZXMgYW5kIHByb2JhYmx5IGJlYXRzIGFsbC4gSG93
ZXZlciwgdGhlcmUgbWF5IGJlIGEgYnVnIGluIDEuNQpjb21waWxlci4KCkFsc28sIHRoZXJlIGFw
cGVhcnMgdG8gYmUgYSBidWcgd2l0aCBKYXZhIDEuNSdzIHVuaWNvZGUgc3VwcG9ydC4gVGhlCmZv
bGxvd2luZyBjb2RlIGNvbXBpbGVzIGZpbmUgaW4gMS40LCBidXQgdW5kZXIgMS41IHRoZSBjb21w
aWxlcgpjb21wbGFpbnMgYWJvdXQgdGhlIG5hbWUgeDEuc3Ry4piFLgoKY2xhc3Mg5pa5IHsKICAg
IFN0cmluZyBzdHLljJcgPQoi5YyX5pa55pyJ5L2z5Lq6LOe1leS4luiAjOeNqOeri+OAglxu5LiA
6aGn5YK+5Lq65Z+OLOWGjemhp+WCvuS6uuWbveOAglxu5a+n5LiN55+l5YC+5Z+O5LiO5YC+5Zu9
44CCXG7kvbPkurrpm6Plho3lvpfjgIIiOwogICAgU3RyaW5nIHN0cuKYhT0izrjPgM6xzrLOs867
z5XPgc66z4gg4omk4oml4omg4omI4oqC4oqD4oqG4oqH4oiICuKFh+KFiOKFieKInuKIhsKwIOKE
teKEnOKEguKEneKEmuKEmeKEpCDihJPiiJ/iiKDiiKEg4oiA4oiDIOKIq+KIkeKIjwriipXiipfi
ipniipriipviiJjiiJkg4piF4piGIjsKCn0KCmNsYXNzIFVuaWNvZGVUZXN0IHsKICAgIHB1Ymxp
YyBzdGF0aWMgdm9pZCBtYWluKFN0cmluZ1tdIGFyZykgewogICAgICAgIOaWuSB4MSA9IG5ldyDm
lrkoKTsKICAgICAgICBTeXN0ZW0ub3V0LnByaW50bG4oIHgxLnN0cuWMlyApOwogICAgICAgIFN5
c3RlbS5vdXQucHJpbnRsbiggeDEuc3Ry4piFICk7CiAgICB9Cn0KCklmIHlvdSBrbm93IGEgbGFu
ZyB0aGF0IGRvZXMgZnVsbCB1bmljb2RlIHN1cHBvcnQsIHBsZWFzZSBsZXQgbWUga25vdy4KVGhh
bmtzLgoKICAgWGFoCiAgIHhhaEB4YWhsZWUub3JnCiDiiJEgaHR0cDovL3hhaGxlZS5vcmcvCg==

From: Frank Buss
Subject: Re: languages with full unicode support
Date: 
Message-ID: <4eeycqavp7ld$.1sthhf3vp4w84$.dlg@40tude.net>
Xah Lee wrote:

> Lisps �� No.

The Common Lisp spec (CLHS) doesn't require that implementations support
Unicode characters, but it doesn't forbid it and some implementations
support it, e.g. http://clisp.cons.org/impnotes.html

-- 
Frank Buss, ··@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
From: Mumia W.
Subject: Re: languages with full unicode support
Date: 
Message-ID: <HsBng.1164$NP4.480@newsread1.news.pas.earthlink.net>
Xah Lee wrote:
> Languages with Full Unicode Support
> 
> As far as i know, Java and JavaScript are languages with full, complete
> unicode support. That is, they allow names to be defined using unicode.
> (the JavaScript engine used by FireFox support this)
> 
> As far as i know, here's few other lang's status:
> 
> C → No.
> Python → No.
> Perl → No.

Perl supports unicode in its core, and that include identifier names 
using exotic characters.


> Haskell → Yes by the spec, but no on existing compilers.

Erm, isn't this an effective "No"?

> JavaScript → No in general. Firefox's engine do support it.
> Lisps → No.
> unix shells (bash)  → No. (this probably applies to all unix shells)
> Java → Yes and probably beats all. However, there may be a bug in 1.5
> compiler.
> 
> Also, there appears to be a bug with Java 1.5's unicode support. The
> following code compiles fine in 1.4, but under 1.5 the compiler
> complains about the name x1.str★.
> 
> class 方 {
>     String str北 =
> "北方有佳人,絕世而獨立。\n一顧傾人城,再顧傾人国。\n寧不知倾城与倾国。\n佳人難再得。";
>     String str★="θπαβγλϕρκψ ≤≥≠≈⊂⊃⊆⊇∈
> ⅇⅈⅉ∞∆° ℵℜℂℝℚℙℤ ℓ∟∠∡ ∀∃ ∫∑∏
> ⊕⊗⊙⊚⊛∘∙ ★☆";
> 
> }
> 
> class UnicodeTest {
>     public static void main(String[] arg) {
>         方 x1 = new 方();
>         System.out.println( x1.str北 );
>         System.out.println( x1.str★ );
>     }
> }
> 
> If you know a lang that does full unicode support, please let me know.
> Thanks.
> 
>    Xah
>    ···@xahlee.org
>  ∑ http://xahlee.org/

Perl is coming close to having full unicode support. '★' is not an 
alphabetic or numeric character and has no place in an identifier. That 
is why both Perl and Java reject it. Let's see what Perl can do:

#!/usr/bin/perl

use strict;
use warnings;
use utf8;

package 方;
our $str北="北方有佳人,絕世而獨立。\n一顧傾人城,再顧傾人国。"
     . "\n寧不知倾城与倾国。\n佳人難再得。";

our $strβ = "θπαβγλϕρκψ ≤≥≠≈⊂⊃⊆⊇∈
ⅇⅈⅉ∞∆° ℵℜℂℝℚℙℤ ℓ∟∠∡ ∀∃ ∫∑∏
⊕⊗⊙⊚⊛∘∙ ★☆";

sub new {
     my $class = shift;
     my $self = {
         str北 => \$str北,
         'strβ' , \$strβ,
     };
     bless ($self, $class);
}

sub str北 {
     ${ (shift)->{str北} };
}

sub strβ {
     ${ (shift)->{strβ} };
};

package Test方;

sub do {
     binmode STDOUT, 'utf8';
     my $obj方 = 方->new();
     $\ = "\n";
     print $obj方->str北();
     print '----------------';
     print $obj方->strβ();
}

Test方->do();
From: Darren New
Subject: Re: languages with full unicode support
Date: 
Message-ID: <qgCng.16196$Z67.455@tornado.socal.rr.com>
Xah Lee wrote:
> If you know a lang that does full unicode support, please let me know.

Tcl.  You may have to modify the "source" command to get it to default 
to something other than the system encoding, but this is trivial in Tcl.

-- 
   Darren New / San Diego, CA, USA (PST)
     Native Americans used every part
     of the buffalo, including the wings.
From: OMouse
Subject: Re: languages with full unicode support
Date: 
Message-ID: <1151294206.255080.216880@m73g2000cwd.googlegroups.com>
> As far as i know, here's few other lang's status:
>
> C → No.

I think C has the wchar type to handle larger values. And C++ has
std::wstring. So really, the support is there.
http://www.cl.cam.ac.uk/~mgk25/unicode.html#c

I think the problem is that most C/C++ coders don't care about unicode
support and so they stick to char and std::string.
From: Oliver Bandel
Subject: Re: languages with full unicode support
Date: 
Message-ID: <1151276031.641318@elch.in-berlin.de>
こんいちわ Xah-Lee san ;-)


Xah Lee wrote:

> Languages with Full Unicode Support
> 
> As far as i know, Java and JavaScript are languages with full, complete
> unicode support. That is, they allow names to be defined using unicode.

Can you explain what you mena with the names here?


> (the JavaScript engine used by FireFox support this)
> 
> As far as i know, here's few other lang's status:
> 
> C → No.

Well, is this (only) a language issue?

On Plan-9 all things seem to be UTF-8 based,
and when you use C for programming, I would think
that C can handle this also.

But I only have read some papers about Plan-9 and did not developed on 
it....

Only a try to have a different view on it.

If someone knows more, please let us know :)


Ciao,
    Oliver
From: Oliver Wong
Subject: Re: languages with full unicode support
Date: 
Message-ID: <BbSng.15374$B91.5146@edtnps82>
"Oliver Bandel" <······@first.in-berlin.de> wrote in message 
······················@elch.in-berlin.de...
>
> Xah Lee wrote:
>
>>
>> As far as i know, Java and JavaScript are languages with full, complete
>> unicode support. That is, they allow names to be defined using unicode.
>
> Can you explain what you mena with the names here?

    As in variable names, function names, class names, etc.

    - Oliver 
From: Tin Gherdanarra
Subject: Re: languages with full unicode support
Date: 
Message-ID: <4gd3iiF1md4lcU1@individual.net>
Oliver Bandel wrote:
> 
> こんいちわ Xah-Lee san ;-)

Uhm, I'd guess that Xah is Chinese. Be careful
with such things in real life; Koreans might
beat you up for this. Stay alive!


> 
> 
> Xah Lee wrote:
> 
>> Languages with Full Unicode Support
>>
>> As far as i know, Java and JavaScript are languages with full, complete
>> unicode support. That is, they allow names to be defined using unicode.
> 
> 
> Can you explain what you mena with the names here?
> 
> 
>> (the JavaScript engine used by FireFox support this)
>>
>> As far as i know, here's few other lang's status:
>>
>> C → No.
> 
> 
> Well, is this (only) a language issue?
> 
> On Plan-9 all things seem to be UTF-8 based,
> and when you use C for programming, I would think
> that C can handle this also.
> 
> But I only have read some papers about Plan-9 and did not developed on 
> it....
> 
> Only a try to have a different view on it.
> 
> If someone knows more, please let us know :)
> 
> 
> Ciao,
>    Oliver


-- 
Lisp kann nicht kratzen, denn Lisp ist fluessig
From: Matthias Blume
Subject: Re: languages with full unicode support
Date: 
Message-ID: <m164iml3px.fsf@hana.uchicago.edu>
Tin Gherdanarra <···········@gmail.com> writes:

> Oliver Bandel wrote:
>> ���񂢂��� Xah-Lee san ;-)
>
> Uhm, I'd guess that Xah is Chinese. Be careful
> with such things in real life; Koreans might
> beat you up for this. Stay alive!

And the Japanese might beat him up, too.  For butchering their
language. :-)
From: Oliver Bandel
Subject: Re: languages with full unicode support
Date: 
Message-ID: <1151857200.499567@elch.in-berlin.de>
Matthias Blume wrote:

> Tin Gherdanarra <···········@gmail.com> writes:
> 
> 
>>Oliver Bandel wrote:
>>
>>>こんいちわ Xah-Lee san ;-)
>>
>>Uhm, I'd guess that Xah is Chinese. Be careful
>>with such things in real life; Koreans might
>>beat you up for this. Stay alive!
> 
> 
> And the Japanese might beat him up, too.  For butchering their
> language. :-)

OK, back to ISO-8859-1 :)  no one needs so much symbols,
this is enough: äöüÄÖÜß :)


Ciao,
   Oliver
From: Matthias Blume
Subject: Re: languages with full unicode support
Date: 
Message-ID: <m2hd1zzlnh.fsf@hanabi.local>
Oliver Bandel <······@first.in-berlin.de> writes:

>>>Oliver Bandel wrote:
>>>
>>>>こんいちわ Xah-Lee san ;-)
>>>
>>>Uhm, I'd guess that Xah is Chinese. Be careful
>>>with such things in real life; Koreans might
>>>beat you up for this. Stay alive!
>> And the Japanese might beat him up, too.  For butchering their
>> language. :-)
>
> OK, back to ISO-8859-1 :)  no one needs so much symbols,
> this is enough: äöüÄÖÜß :)

There are plenty of people who need such symbols (more people than
those who need ß, btw).

Matthias

PS: It should have been こんにちは.
From: Joachim Durchholz
Subject: Re: languages with full unicode support
Date: 
Message-ID: <e8d4u9$sdf$1@online.de>
Oliver Bandel schrieb:
> Matthias Blume wrote:
> 
>> Tin Gherdanarra <···········@gmail.com> writes:
>>
>>
>>> Oliver Bandel wrote:
>>>
>>>> こんいちわ Xah-Lee san ;-)
>>>
>>> Uhm, I'd guess that Xah is Chinese. Be careful
>>> with such things in real life; Koreans might
>>> beat you up for this. Stay alive!
>>
>>
>> And the Japanese might beat him up, too.  For butchering their
>> language. :-)
> 
> OK, back to ISO-8859-1 :)  no one needs so much symbols,
> this is enough: äöüÄÖÜß :)

If you want äöüÄÖÜß, anybody else will want their local characters, too, 
and nothing below full Unicode will work.

Just for laughs, here's a list of non-ASCII Latin-based letters in 
Unicode (not verified for completeness):
   ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹ǺǻǼǽ
   ÇçĆćĈĉĊċČč
   ĎďĐđ
   ÈÉÊËèéêëĒēĔĕĖėĘęĚě
   ĜĝĞğĠġĢģ
   ĤĥĦħ
   ÌÍÎÏìíîïĨĩĪīĬĭĮįİıIJij
   Ĵĵ
   Ķķĸ
   ĹĺĻļĽĿŀŁł
   Ðð
   ÑñŃńŅņŇňʼnŊŋ
   ÒÓÔÕØòóôöõŌōŎŏÖŐőŒœǾǿ
   ŔŕŖŗŘř
   ŚśŜŝŞşŠšß
   ŢţŤťŦŧ
   ÜÙÚÛüùúûŨũŪūŭŮůŰűŲų
   Ŵŵ
   ÝýÿŶŷŸ
   Þþ
   ŹźŻżŽž
   ƒſ
ISO 8859-1 covers just a fraction of these, so Unicode would indeed be 
necessary to allow a program written in one country to compile in 
another one.

Regards,
Jo
From: Pascal Bourguignon
Subject: Re: languages with full unicode support
Date: 
Message-ID: <873bdhvfi6.fsf@thalassa.informatimago.com>
Joachim Durchholz <··@durchholz.org> writes:

> Oliver Bandel schrieb:
>> Matthias Blume wrote:
>> 
>>> Tin Gherdanarra <···········@gmail.com> writes:
>>>
>>>
>>>> Oliver Bandel wrote:
>>>>
>>>>> こんいちわ Xah-Lee san ;-)
>>>>
>>>> Uhm, I'd guess that Xah is Chinese. Be careful
>>>> with such things in real life; Koreans might
>>>> beat you up for this. Stay alive!
>>>
>>>
>>> And the Japanese might beat him up, too.  For butchering their
>>> language. :-)
>> OK, back to ISO-8859-1 :)  no one needs so much symbols,
>> this is enough: äöüÄÖÜß :)
>
> If you want äöüÄÖÜß, anybody else will want their local characters,
> too, and nothing below full Unicode will work.
>
> Just for laughs, here's a list of non-ASCII Latin-based letters in
> Unicode (not verified for completeness):
>   ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹ǺǻǼǽ
>   ÇçĆćĈĉĊċČč
>   ĎďĐđ
>   ÈÉÊËèéêëĒēĔĕĖėĘęĚě
>   ĜĝĞğĠġĢģ
>   ĤĥĦħ
>   ÌÍÎÏìíîïĨĩĪīĬĭĮįİıIJij
>   Ĵĵ
>   Ķķĸ
>   ĹĺĻļĽĿŀŁł
>   Ðð
>   ÑñŃńŅņŇňʼnŊŋ
>   ÒÓÔÕØòóôöõŌōŎŏÖŐőŒœǾǿ
>   ŔŕŖŗŘř
>   ŚśŜŝŞşŠšß
>   ŢţŤťŦŧ
>   ÜÙÚÛüùúûŨũŪūŭŮůŰűŲų
>   Ŵŵ
>   ÝýÿŶŷŸ
>   Þþ
>   ŹźŻżŽž
>   ƒſ
> ISO 8859-1 covers just a fraction of these, so Unicode would indeed be
> necessary to allow a program written in one country to compile in
> another one.

Indeed, far from complete:

(coerce (lschar :name "LATIN") 'string)
--> "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
     ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóô
     õöøùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩ
     ĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝ
     ŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑ
     ƒƓƔƕƖƗƘƙƚƛƜƝƞƟƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿǀǁǂǃDŽDž
     džLJLjljNJNjnjǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰDZDzdzǴǵǶǷǸǹ
     ǺǻǼǽǾǿȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟȠȢȣȤȥȦȧȨȩȪȫȬȭȮ
     ȯȰȱȲȳɐɑɒɓɔɕɖɗɘəɚɛɜɝɞɟɠɡɢɣɤɥɦɧɨɩɪɫɬɭɮɯɰɱɲɳɴɵɶɷɸɹɺɻɼɽɾ
     ɿʀʁʂʃʄʅʆʇʈʉʊʋʌʍʎʏʐʑʒʓʔʕʖʗʘʙʚʛʜʝʞʟʠʡʢʣʤʥʦʧʨʩʪʫʬʭͣͤͥͦͧ
     ͨͩͪͫͬͭͮͯḀḁḂḃḄḅḆḇḈḉḊḋḌḍḎḏḐḑḒḓḔḕḖḗḘḙḚḛḜḝḞḟḠḡḢḣḤḥḦḧḨḩḪḫ
     ḬḭḮḯḰḱḲḳḴḵḶḷḸḹḺḻḼḽḾḿṀṁṂṃṄṅṆṇṈṉṊṋṌṍṎṏṐṑṒṓṔṕṖṗṘṙṚṛṜṝṞṟ
     ṠṡṢṣṤṥṦṧṨṩṪṫṬṭṮṯṰṱṲṳṴṵṶṷṸṹṺṻṼṽṾṿẀẁẂẃẄẅẆẇẈẉẊẋẌẍẎẏẐẑẒẓ
     ẔẕẖẗẘẙẚẛẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊị
     ỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮữỰựỲỳỴỵỶỷỸỹⁱⁿ⒜⒝⒞⒟
     ⒠⒡⒢⒣⒤⒥⒦⒧⒨⒩⒪⒫⒬⒭⒮⒯⒰⒱⒲⒳⒴⒵ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏⓐⓑⓒⓓ
     ⓔⓕⓖⓗⓘⓙⓚⓛⓜⓝⓞⓟⓠⓡⓢⓣⓤⓥⓦⓧⓨⓩ✝✞✟fffiflffifflſtstABCDEFGHIJ
     KLMNOPQRSTUVWXYZabcdefghij
     klmnopqrstuvwxyz"


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

READ THIS BEFORE OPENING PACKAGE: According to certain suggested
versions of the Grand Unified Theory, the primary particles
constituting this product may decay to nothingness within the next
four hundred million years.
From: Mumia W.
Subject: Re: languages with full unicode support
Date: 
Message-ID: <yluqg.2825$cd3.1769@newsread3.news.pas.earthlink.net>
Pascal Bourguignon wrote:
> [...]
> (coerce (lschar :name "LATIN") 'string)
> --> "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
>      ����������������������������������������������������
 > [...]

In what programming language/interpreter is this code?
From: Tim Roberts
Subject: Re: languages with full unicode support
Date: 
Message-ID: <lib4a299f2oo11gl62fivn1tijdk73o5ip@4ax.com>
"Xah Lee" <···@xahlee.org> wrote:

>Languages with Full Unicode Support
>
>As far as i know, Java and JavaScript are languages with full, complete
>unicode support. That is, they allow names to be defined using unicode.
>(the JavaScript engine used by FireFox support this)
>
>As far as i know, here's few other lang's status:
>
>C ? No.

This is implementation-defined in C.  A compiler is allowed to accept
variable names with alphabetic Unicode characters outside of ASCII.
-- 
- Tim Roberts, ····@probo.com
  Providenza & Boekelheide, Inc.
From: David Hopwood
Subject: Re: languages with full unicode support
Date: 
Message-ID: <Jltog.230285$8W1.8494@fe1.news.blueyonder.co.uk>
Tim Roberts wrote:
> "Xah Lee" <···@xahlee.org> wrote:
> 
>>Languages with Full Unicode Support
>>
>>As far as i know, Java and JavaScript are languages with full, complete
>>unicode support. That is, they allow names to be defined using unicode.
>>(the JavaScript engine used by FireFox support this)
>>
>>As far as i know, here's few other lang's status:
>>
>>C ? No.
> 
> This is implementation-defined in C.  A compiler is allowed to accept
> variable names with alphabetic Unicode characters outside of ASCII.

It is not implementation-defined in C99 whether Unicode characters are
accepted; only how they are encoded directly in the source multibyte character
set.

Characters escaped using \uHHHH or \U00HHHHHH (H is a hex digit), and that
are in the sets of characters defined by Unicode for identifiers, are required
to be supported, and should be mangled in some consistent way by a platform's
linker. There are Unicode text editors which encode/decode \u and \U on the fly,
so you can treat this essentially like a Unicode transformation format (it
would have been nicer to require support for UTF-8, but never mind).


C99 6.4.2.1:

# 3 Each universal character name in an identifier shall designate a character
#   whose encoding in ISO/IEC 10646 falls into one of the ranges specified in
#   annex D. 59) The initial character shall not be a universal character name
#   designating a digit. An implementation may allow multibyte characters that
#   are not part of the basic source character set to appear in identifiers;
#   which characters and their correspondence to universal character names is
#   implementation-defined.
#
# 59) On systems in which linkers cannot accept extended characters, an encoding
#     of the universal character name may be used in forming valid external
#     identifiers. For example, some otherwise unused character or sequence of
#     characters may be used to encode the \u in a universal character name.
#     Extended characters may produce a long external identifier.

-- 
David Hopwood <····················@blueyonder.co.uk>
From: Joachim Durchholz
Subject: Re: languages with full unicode support
Date: 
Message-ID: <e7tj0t$n9i$2@online.de>
Tim Roberts schrieb:
> "Xah Lee" <···@xahlee.org> wrote:
>> C ? No.
> 
> This is implementation-defined in C.  A compiler is allowed to accept
> variable names with alphabetic Unicode characters outside of ASCII.

Hmm... that could would be nonportable, so C support for Unicode is 
half-baked at best.

Regards,
Jo
From: Thomas A. Russ
Subject: Re: languages with full unicode support
Date: 
Message-ID: <ymiy7vh2oc6.fsf@sevak.isi.edu>
Joachim Durchholz <··@durchholz.org> writes:

> Tim Roberts schrieb:
> > "Xah Lee" <···@xahlee.org> wrote:
> >> C ? No.
> > This is implementation-defined in C.  A compiler is allowed to accept
> 
> > variable names with alphabetic Unicode characters outside of ASCII.
> 
> Hmm... that could would be nonportable, so C support for Unicode is
> half-baked at best.

And that differs from C support for any given feature X in exactly what
manner?   ;)



Sorry, I couldn't resist.

-- 
Thomas A. Russ,  USC/Information Sciences Institute
From: Chris Uppal
Subject: Re: languages with full unicode support
Date: 
Message-ID: <44a26911$1$660$bed64819@news.gradwell.net>
Joachim Durchholz wrote:

> > This is implementation-defined in C.  A compiler is allowed to accept
> > variable names with alphabetic Unicode characters outside of ASCII.
>
> Hmm... that could would be nonportable, so C support for Unicode is
> half-baked at best.

Since the interpretation of characters which are yet to be added to
Unicode is undefined (will they be digits, "letters", operators, symbol,
punctuation.... ?), there doesn't seem to be any sane way that a language could
allow an unrestricted choice of Unicode in identifiers.  Hence, it must define
a specific allowed sub-set.  C certainly defines an allowed subset of Unicode
characters -- so I don't think you could call its Unicode support "half-baked"
(not in that respect, anyway).  A case -- not entirely convincing, IMO -- could
be made that it would be better to allow a wider range of characters.

And no, I don't think Java's approach -- where there /is no defined set of
allowed identifier characters/ -- makes any sense at all :-(

    -- chris
From: David Hopwood
Subject: Java identifiers (was: languages with full unicode support)
Date: 
Message-ID: <6Uvog.490321$tc.256914@fe2.news.blueyonder.co.uk>
Note Followup-To: comp.lang.java.programmer

Chris Uppal wrote:
> Since the interpretation of characters which are yet to be added to
> Unicode is undefined (will they be digits, "letters", operators, symbol,
> punctuation.... ?), there doesn't seem to be any sane way that a language could
> allow an unrestricted choice of Unicode in identifiers.  Hence, it must define
> a specific allowed sub-set.  C certainly defines an allowed subset of Unicode
> characters -- so I don't think you could call its Unicode support "half-baked"
> (not in that respect, anyway).  A case -- not entirely convincing, IMO -- could
> be made that it would be better to allow a wider range of characters.
> 
> And no, I don't think Java's approach -- where there /is no defined set of
> allowed identifier characters/ -- makes any sense at all :-(

Java does have a defined set of allowed identifier characters. However, you
certainly have to go around the houses a bit to work out what that set is:


<http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.8>

# An identifier is an unlimited-length sequence of Java letters and Java digits,
# the first of which must be a Java letter. An identifier cannot have the same
# spelling (Unicode character sequence) as a keyword (§3.9), boolean literal
# (§3.10.3), or the null literal (§3.10.7).
[...]
# A "Java letter" is a character for which the method
# Character.isJavaIdentifierStart(int) returns true. A "Java letter-or-digit"
# is a character for which the method Character.isJavaIdentifierPart(int)
# returns true.
[...]
# Two identifiers are the same only if they are identical, that is, have the
# same Unicode character for each letter or digit.

For Java 1.5.0:

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html>

# Character information is based on the Unicode Standard, version 4.0.

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isJavaIdentifierStart(int)>

# A character may start a Java identifier if and only if one of the following
# conditions is true:
#
#   * isLetter(codePoint) returns true
#   * getType(codePoint) returns LETTER_NUMBER
#   * the referenced character is a currency symbol (such as "$")

[This means that getType(codePoint) returns CURRENCY_SYMBOL, i.e. Unicode
General Category Sc.]

#   * the referenced character is a connecting punctuation character (such as "_").

[This means that getType(codePoint) returns CONNECTOR_PUNCTUATION, i.e. Unicode
General Category Pc.]

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isJavaIdentifierPart(int)>

# A character may be part of a Java identifier if any of the following are true:
#
#   * it is a letter
#   * it is a currency symbol (such as '$')
#   * it is a connecting punctuation character (such as '_')
#   * it is a digit
#   * it is a numeric letter (such as a Roman numeral character)

[General Category Nl.]

#   * it is a combining mark

[General Category Mc (see <http://www.unicode.org/versions/Unicode4.0.0/ch04.pdf>).]

#   * it is a non-spacing mark

[General Category Mn (ditto).]

#   * isIdentifierIgnorable(codePoint) returns true for the character

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isDigit(int)>

# A character is a digit if its general category type, provided by
# getType(codePoint), is DECIMAL_DIGIT_NUMBER.

[General Category Nd.]

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isIdentifierIgnorable(int)>

# The following Unicode characters are ignorable in a Java identifier or a Unicode
# identifier:
#
#   * ISO control characters that are not whitespace
#         o '\u0000' through '\u0008'
#         o '\u000E' through '\u001B'
#         o '\u007F' through '\u009F'
#   * all characters that have the FORMAT general category value

[FORMAT is General Category Cf.]

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isLetter(int)>

# A character is considered to be a letter if its general category type, provided
# by getType(codePoint), is any of the following:
#
#   * UPPERCASE_LETTER
#   * LOWERCASE_LETTER
#   * TITLECASE_LETTER
#   * MODIFIER_LETTER
#   * OTHER_LETTER

====

To cut a long story short, the syntax of identifiers in Java 1.5 is therefore:

  Keyword ::= one of
        abstract    continue    for           new          switch
        assert      default     if            package      synchronized
        boolean     do          goto          private      this
        break       double      implements    protected    throw
        byte        else        import        public       throws
        case        enum        instanceof    return       transient
        catch       extends     int           short        try
        char        final       interface     static       void
        class       finally     long          strictfp     volatile
        const       float       native        super        while

  Identifier        ::= IdentifierChars butnot (Keyword | "true" | "false" | "null")
  IdentifierChars   ::= JavaLetter | IdentifierChars JavaLetterOrDigit
  JavaLetter        ::= Lu | Ll | Lt | Lm | Lo | Nl | Sc | Pc
  JavaLetterOrDigit ::= JavaLetter | Nd | Mn | Mc |
                        U+0000..0008 | U+000E..001B | U+007F..009F | Cf

where the two-letter terminals refer to General Categories in Unicode 4.0.0
(exactly).

Note that the so-called "ignorable" characters (for which
isIdentifierIgnorable(codePoint) returns true) are not ignorable; they are
treated like any other identifier character. This quote from the API spec:

# The following Unicode characters are ignorable in a Java identifier [...]

should be ignored (no pun intended). It is contradicted by:

# Two identifiers are the same only if they are identical, that is, have the
# same Unicode character for each letter or digit.

in the language spec. Unicode does have a concept of ignorable characters in
identifiers, which is probably where this documentation bug crept in.

The inclusion of U+0000 and various control characters in the set of valid
identifier characters is also a dubious decision, IMHO.

Note that I am not defending in any way the complexity of this definition; there's
clearly no excuse for it (or for the "ignorable" documentation bug). The language
spec should have been defined directly in terms of the Unicode General Categories,
and then the API in terms of the language spec. They way it is done now is
completely backwards.

-- 
David Hopwood <····················@blueyonder.co.uk>
From: Joachim Durchholz
Subject: Re: languages with full unicode support
Date: 
Message-ID: <e8598q$et4$1@online.de>
Chris Uppal schrieb:
> Joachim Durchholz wrote:
> 
>>> This is implementation-defined in C.  A compiler is allowed to accept
>>> variable names with alphabetic Unicode characters outside of ASCII.
>> Hmm... that could would be nonportable, so C support for Unicode is
>> half-baked at best.
> 
> Since the interpretation of characters which are yet to be added to
> Unicode is undefined (will they be digits, "letters", operators, symbol,
> punctuation.... ?), there doesn't seem to be any sane way that a language could
> allow an unrestricted choice of Unicode in identifiers.

I don't think this is a problem in practice. E.g. if a language uses the 
usual definition for identifiers (first letter, then letters/digits), 
you end up with a language that changes its definition on the whims of 
the Unicode consortium, but that's less of a problem than one might 
think at first.

I'd expect two kinds of changes in character categorization: additions 
and corrections. (Any other?)

Additions are relatively unproblematic. Existing code will remain valid 
and retain its semantics. The new characters will be available for new 
programs.
There's a slight technological complication: the compiler needs to be 
able to look up the newest definition. In other words, for a compiler to 
run, it needs to be able to access http://unicode.org, or the language 
infrastructure needs a way to carry around various revisions of the 
Unicode tables and select the newest one.

Corrections are technically more problematic, but then we can rely on 
the common sense of the programmers. If the Unicode consortium 
miscategorized a character as a letter, the programmers that use that 
character set will probably know it well enough to avoid its use. It 
will probably not even occur to them that that character could be a 
letter ;-)


Actually I'm not sure that Unicode is important for long-lived code. 
Code tends to not survive very long unless it's written in English, in 
which case anything outside of strings is in 7-bit ASCII. So the 
majority of code won't ever be affected by Unicode problems - Unicode is 
more a way of lowering entry barriers.

Regards,
Jo
From: David Hopwood
Subject: Re: languages with full unicode support
Date: 
Message-ID: <UEupg.30647$7Z6.25529@fe2.news.blueyonder.co.uk>
Joachim Durchholz wrote:
> Chris Uppal schrieb:
>> Joachim Durchholz wrote:
>>
>>>> This is implementation-defined in C.  A compiler is allowed to accept
>>>> variable names with alphabetic Unicode characters outside of ASCII.
>>>
>>> Hmm... that could would be nonportable, so C support for Unicode is
>>> half-baked at best.
>>
>> Since the interpretation of characters which are yet to be added to
>> Unicode is undefined (will they be digits, "letters", operators, symbol,
>> punctuation.... ?), there doesn't seem to be any sane way that a
>> language could allow an unrestricted choice of Unicode in identifiers.
> 
> I don't think this is a problem in practice. E.g. if a language uses the
> usual definition for identifiers (first letter, then letters/digits),
> you end up with a language that changes its definition on the whims of
> the Unicode consortium, but that's less of a problem than one might
> think at first.

It is not a problem at all. See the stability policies in
<http://www.unicode.org/reports/tr31/tr31-2.html>.

> Actually I'm not sure that Unicode is important for long-lived code.
> Code tends to not survive very long unless it's written in English, in
> which case anything outside of strings is in 7-bit ASCII. So the
> majority of code won't ever be affected by Unicode problems - Unicode is
> more a way of lowering entry barriers.

Unicode in identifiers has certainly been less important than some thought
it would be -- and not at all important for open source projects, for example,
which essentially have to use English to get the widest possible participation.

-- 
David Hopwood <····················@blueyonder.co.uk>
From: Dr.Ruud
Subject: Re: languages with full unicode support
Date: 
Message-ID: <e85r9h.11k.1@news.isolution.nl>
Chris Uppal schreef:

> Since the interpretation of characters which are yet to be added to
> Unicode is undefined (will they be digits, "letters", operators,
> symbol, punctuation.... ?), there doesn't seem to be any sane way
> that a language could allow an unrestricted choice of Unicode in
> identifiers.

The Perl-code below prints:

xdigit
    22 /194522 =  0.011%  (lower:     6, upper:     6)
ascii
   128 /194522 =  0.066%  (lower:    26, upper:    26)
\d
   268 /194522 =  0.138%
digit
   268 /194522 =  0.138%
IsNumber
   612 /194522 =  0.315%
alpha
 91183 /194522 = 46.875%  (lower:  1380, upper:  1160)
alnum
 91451 /194522 = 47.013%  (lower:  1380, upper:  1160)
word
 91801 /194522 = 47.193%  (lower:  1380, upper:  1160)
graph
102330 /194522 = 52.606%  (lower:  1380, upper:  1160)
print
102349 /194522 = 52.616%  (lower:  1380, upper:  1160)
blank
    18 /194522 =  0.009%
space
    24 /194522 =  0.012%
punct
   374 /194522 =  0.192%
cntrl
  6473 /194522 =  3.328%


Especially look at 'word', the same as \w, which for ASCII is
[0-9A-Za-z_].


==8<===================
#!/usr/bin/perl
# Program-Id: unicount.pl
# Subject: show Unicode statistics

  use strict ;
  use warnings ;

  use Data::Alias ;

  binmode STDOUT, ':utf8' ;

  my @table =
  # +--Name------+---qRegexp--------+-C-+-L-+-U-+
  (
    [ 'xdigit'   , qr/[[:xdigit:]]/ , 0 , 0 , 0 ] ,
    [ 'ascii'    , qr/[[:ascii:]]/  , 0 , 0 , 0 ] ,
    [ '\\d'      , qr/\d/           , 0 , 0 , 0 ] ,
    [ 'digit'    , qr/[[:digit:]]/  , 0 , 0 , 0 ] ,
    [ 'IsNumber' , qr/\p{IsNumber}/ , 0 , 0 , 0 ] ,
    [ 'alpha'    , qr/[[:alpha:]]/  , 0 , 0 , 0 ] ,
    [ 'alnum'    , qr/[[:alnum:]]/  , 0 , 0 , 0 ] ,
    [ 'word'     , qr/[[:word:]]/   , 0 , 0 , 0 ] ,
    [ 'graph'    , qr/[[:graph:]]/  , 0 , 0 , 0 ] ,
    [ 'print'    , qr/[[:print:]]/  , 0 , 0 , 0 ] ,
    [ 'blank'    , qr/[[:blank:]]/  , 0 , 0 , 0 ] ,
    [ 'space'    , qr/[[:space:]]/  , 0 , 0 , 0 ] ,
    [ 'punct'    , qr/[[:punct:]]/  , 0 , 0 , 0 ] ,
    [ 'cntrl'    , qr/[[:cntrl:]]/  , 0 , 0 , 0 ] ,
  ) ;

  my @codepoints =
  (
     0x0000 ..  0xD7FF,
     0xE000 ..  0xFDCF,
     0xFDF0 ..  0xFFFD,
     0x10000 .. 0x1FFFD,
     0x20000 .. 0x2FFFD,
#    0x30000 .. 0x3FFFD, # etc.
  ) ;

  for my $row ( @table )
  {
    alias my ($name, $qrx, $count, $lower, $upper) = @$row ;

    printf "\n%s\n", $name ;

    my $n = 0 ;

    for ( @codepoints )
    {
      local $_ = chr ;  # int-2-char conversion
      $n++ ;

      if ( /$qrx/ )
      {
        $count++ ;
        $lower++ if / [[:lower:]] /x ;
        $upper++ if / [[:upper:]] /x ;
      }
    }

    my $show_lower_upper =
      ($lower || $upper)
      ? sprintf( "  (lower:%6d, upper:%6d)"
               , $lower
               , $upper
               )
      : '' ;

    printf "%6d /%6d =%7.3f%%%s\n"
           , $count
           , $n
           , 100 * $count / $n
           , $show_lower_upper
  }
__END__

-- 
Affijn, Ruud

"Gewoon is een tijger."
From: Dale King
Subject: Re: languages with full unicode support
Date: 
Message-ID: <6_adneQRdcb3YzbZnZ2dnUVZ_vGdnZ2d@insightbb.com>
Tim Roberts wrote:
> "Xah Lee" <···@xahlee.org> wrote:
> 
>> Languages with Full Unicode Support
>>
>> As far as i know, Java and JavaScript are languages with full, complete
>> unicode support. That is, they allow names to be defined using unicode.
>> (the JavaScript engine used by FireFox support this)
>>
>> As far as i know, here's few other lang's status:
>>
>> C ? No.
> 
> This is implementation-defined in C.  A compiler is allowed to accept
> variable names with alphabetic Unicode characters outside of ASCII.

I don't think it is implementation defined. I believe it is actually 
required by the spec. The trouble is that so few compilers actually 
comply with the spec. A few years ago I asked for someone to actually 
point to a fully compliant compiler and no one could.

-- 
  Dale King
From: Tim Roberts
Subject: Re: languages with full unicode support
Date: 
Message-ID: <212pa2tnd3p87dvki0o34qgualu64d9hqc@4ax.com>
Dale King <·········@gmail.com> wrote:
>Tim Roberts wrote:
>> "Xah Lee" <···@xahlee.org> wrote:
>> 
>>> Languages with Full Unicode Support
>>>
>>> As far as i know, Java and JavaScript are languages with full, complete
>>> unicode support. That is, they allow names to be defined using unicode.
>>> (the JavaScript engine used by FireFox support this)
>> 
>> This is implementation-defined in C.  A compiler is allowed to accept
>> variable names with alphabetic Unicode characters outside of ASCII.
>
>I don't think it is implementation defined. I believe it is actually 
>required by the spec.

C99 does have a list of Unicode codepoints that are required to be accepted
in identifiers, although implementations are free to accept other
characters as well.  For example, few people realize that Visual C++
accepts the dollar sign $ in an identifier.
-- 
- Tim Roberts, ····@probo.com
  Providenza & Boekelheide, Inc.