embedding Unicode escapes in strings

From: Jerry Boetje
Subject: embedding Unicode escapes in strings
Date: Tue, 04 Oct 2005 14:13:31 +0000
Message-ID: <1128435211.785716.79920@g14g2000cwa.googlegroups.com>

Our development work on CLforJava (see ILC 2005 proceedings) is at a
point where we are looking into how to embed escaped Unicode chars in
strings. In the current std, the only escape char is '\' used largely
to escape '\'. We have a couple of ideas for extending the escape
mechanism to handle Unicode 4 characters.

In Unicode documentation, characters can be represented as U+ followed
by the hex representation of its code point. In CLforJava we already
have Reader support in the form of #\U (followed by 4 hex digits) or
#\U+(followed by 4-6 hex digits). We have 2 ideas for extending this
into strings:

1. Use '\' as the escape character as it is now. But if the character
following the '\' is U (or U+) it will be followed by 4 or 6 hex digits
respectively. Since U (or u) needs no escape, this would have little
effect on existing code.

2. Add the '#' character as an escape character that currently only
supports #U and #U+ followed by 4 or 6 digits as above. This would
require escaping the '#' character (\#) to add it to the string. This
has a greater probability of breaking existing code. On the other hand,
the definition of '\' is not changed.

Comments and suggestions please! thanks

Re: embedding Unicode escapes in strings Adam Warner
- Re: embedding Unicode escapes in strings Jerry Boetje

From: Adam Warner
Subject: Re: embedding Unicode escapes in strings
Date: Tue, 04 Oct 2005 23:10:20 +0000
Message-ID: <pan.2005.10.04.23.10.18.378821@consulting.net.nz>

On Tue, 04 Oct 2005 07:13:31 -0700, Jerry Boetje wrote:

> 1. Use '\' as the escape character as it is now. But if the character
> following the '\' is U (or U+) it will be followed by 4 or 6 hex digits
> respectively. Since U (or u) needs no escape, this would have little
> effect on existing code.

I'd suggest the same notation as Java 5.0+:
<http://java.sun.com/developer/technicalArticles/Intl/Supplementary/>

   For text input, the Java 2 SDK provides a code point input method which
   accepts strings of the form "\Uxxxxxx", where the uppercase "U"
   indicates that the escape sequence contains six hexadecimal digits,
   thus allowing for supplementary characters. A lowercase "u" indicates
   the original form of the escape sequences, "\uxxxx". You can find this
   input method and its documentation in the directory
   demo/jfc/CodePointIM of the J2SDK.

Regards,
Adam

From: Jerry Boetje
Subject: Re: embedding Unicode escapes in strings
Date: Wed, 05 Oct 2005 14:02:37 +0000
Message-ID: <1128520957.712680.76890@o13g2000cwo.googlegroups.com>

An excellent suggestion Thanks very much.