8-bit input (or, "Perl attacks on non-English language communities!")

From: Matt Curtin
Subject: 8-bit input (or, "Perl attacks on non-English language communities!")
Date: Mon, 08 Feb 1999 00:00:00 +0000
Message-ID: <xlxiudcizfn.fsf@gold.cis.ohio-state.edu>

[Please honor followups.  I don't *really* want to start a flamewar.
But this should be entertaining for the readers of these groups...]

I've been thinking a lot about languages lately.  Languages are cool.
Lest you get the impression that I mean programming languages, which
can also be cool, I'm talking about natural languages.  English,
German, Russian ... that sort of thing.

Given that Perl was started by a linguist, namely Larry, and has also
been under the influence of another, namely Tom, it seemed to me that
if any language would be friendly to languages other than boring
English (and Esperanto!) with non-accented, 7-bit ASCII characters, it
would be Perl.

Trying to feed things like character code (such as, for example,
0x81), to parsers has been known to produce results that one might
generously classify as "suboptimal".  But if there's any parser that
can handle such things, thought I, it would be written by Larry,
Mr. Weird Parsing himself.  And if there is a language anywhere for
which a parser to handle such things has been written, it would be
Perl.

Since we have an English module that allows one to use the English
equivalent of such bizarre things that we might classify as "native
Perl", it seemed only fair that we would have other modules that would
allow one to use keywords and whatnot based on languages other than
English, in their native writing systems.  (I recognize the difficulty 
of multibyte encoding schemes, so let's pretend for the time being
that this isn't a problem.)

I mean, really, wouldn't it be cool to be able to do something like

    #!/usr/bin/perl

    use Fran�ais; # minor bootstrapping problem. :-)

    �crivez "Bonjour, monde!\n";

I certainly think so.

We might write a small program to see if it can be done:

    #!/usr/local/bin/perl

    �crivez "Bonjour, monde!\n";

    sub �crivez ($) {
      my $m = shift;
      print $m;
    }

In so doing, we're likely to make this discovery:

    $ ./foo.pl
    Unrecognized character \351 at ./foo.pl line 3.

Well, that's no fun.

How about Python?

    #!/usr/local/bin/python 

    def �crivez (m):
        print m;

    �crivez("Bonjour, monde!");

hmm...

    $ ./foo.py 
      File "./foo.py", line 5
        def �crivez (m):
        ^
    SyntaxError: invalid syntax

What a bore!

Let's see ... what other (programming) languages might support this
sort of thing?  How about Lisp?  XEmacs Lisp, at that?

    (defun �crivez (m)
      "�crivez un message."
      (message m))

Sure enough, evaluating the expression `(�crivez "Bonjour, monde!")'
will write `Bonjour, monde!' in the minibuffer completely sans
complaint.

This is too much.  The implications are astounding.  Newfangled vi
implementations that use Perl as their customization language are
actually less functional than XEmacs Lisp!  Lisp can do something that 
Perl can't!

Pray, what do we do?  Larry, will you start using Lisp?  Tom, will you
start hacking Lisp to customize your Emacs sessions?  Is this the
beginning of the end?

Gleefully yours,
Just Another (Lisp|Perl|Java|Unix|.*)+ Hacker

-- 
Matt Curtin ········@interhack.net http://www.interhack.net/people/cmcurtin/

Re: 8-bit input (or, "Perl attacks on non-English language communities!") Bruno Haible
- Re: 8-bit input (or, "Perl attacks on non-English language communities!") Marco Antoniotti
  - Re: 8-bit input (or, "Perl attacks on non-English language communities!") Johannes Beck
    - Re: 8-bit input (or, "Perl attacks on non-English language communities!") Thomas A. Russ
      - Re: 8-bit input (or, "Perl attacks on non-English language communities!") Kent M Pitman
    - Re: 8-bit input (or, "Perl attacks on non-English language communities!") Erik Naggum
  - Re: 8-bit input (or, "Perl attacks on non-English language communities!") Christopher R. Barry
    - Re: 8-bit input (or, "Perl attacks on non-English language communities!") Howard R. Stearns
      - Re: 8-bit input (or, "Perl attacks on non-English language communities!") Erik Naggum
        Re: 8-bit input (or, "Perl attacks on non-English language communities!") Howard R. Stearns
        Re: 8-bit input (or, "Perl attacks on non-English language communities!") Johan Kullstam
        Re: 8-bit input (or, "Perl attacks on non-English language communities!") Christopher R. Barry
    - Re: 8-bit input (or, "Perl attacks on non-English language communities!") Erik Naggum
- Re: 8-bit input (or, "Perl attacks on non-English language communities!") Rainer Joswig

From: Bruno Haible
Subject: Re: 8-bit input (or, "Perl attacks on non-English language communities!")
Date: Tue, 09 Feb 1999 00:00:00 +0000
Message-ID: <79p88l$8h2$1@news.u-bordeaux.fr>

Matt Curtin  <········@interhack.net> wrote:
>
> Let's see ... what other (programming) languages might support this
> sort of thing?  How about Lisp?  XEmacs Lisp, at that?
>
>    (defun �crivez (m)
>      "�crivez un message."
>      (message m))
>
> Sure enough, evaluating the expression `(�crivez "Bonjour, monde!")'
> will write `Bonjour, monde!' in the minibuffer completely sans
> complaint.

CLISP Common Lisp has no problems with this either:

[1]> (defun �crivez (m)
       "�crivez un message."
       (write-line m))
�CRIVEZ
[2]> (�crivez "Bonjour, tout le monde!")
Bonjour, tout le monde!
"Bonjour, tout le monde!"
[3]> 
� bient�t!

The next release of CLISP will not only be 8-bit clean, it will also support
Unicode 16-bit characters.

              Bruno                              http://clisp.cons.org/

From: Marco Antoniotti
Subject: Re: 8-bit input (or, "Perl attacks on non-English language communities!")
Date: Tue, 09 Feb 1999 00:00:00 +0000
Message-ID: <lwpv7jr9bd.fsf@copernico.parades.rm.cnr.it>

······@clisp.cons.org (Bruno Haible) writes:

> Matt Curtin  <········@interhack.net> wrote:
> >
> > Let's see ... what other (programming) languages might support this
> > sort of thing?  How about Lisp?  XEmacs Lisp, at that?
> >
> >    (defun �crivez (m)
> >      "�crivez un message."
> >      (message m))
> >
> > Sure enough, evaluating the expression `(�crivez "Bonjour, monde!")'
> > will write `Bonjour, monde!' in the minibuffer completely sans
> > complaint.
> 
> CLISP Common Lisp has no problems with this either:
> 
> [1]> (defun �crivez (m)
>        "�crivez un message."
>        (write-line m))
> �CRIVEZ
> [2]> (�crivez "Bonjour, tout le monde!")
> Bonjour, tout le monde!
> "Bonjour, tout le monde!"
> [3]> 
> � bient�t!
> 
> The next release of CLISP will not only be 8-bit clean, it will also support
> Unicode 16-bit characters.
> 


This is a good thing, althogh only relatively.

If you are going to release a Unicode version (or UTF8) of CLISP, it'd
probably be a good thing to post some document that describes the
addendum so that other implementors can take advantage of it.

On top of that, since it seems there will be another round of the ANSI
CL committee, this could be meat for that venue.

Cheers

-- 
Marco Antoniotti ===========================================
PARADES, Via San Pantaleo 66, I-00186 Rome, ITALY
tel. +39 - (0)6 - 68 10 03 17, fax. +39 - (0)6 - 68 80 79 26
http://www.parades.rm.cnr.it

From: Johannes Beck
Subject: Re: 8-bit input (or, "Perl attacks on non-English language  communities!")
Date: Tue, 09 Feb 1999 00:00:00 +0000
Message-ID: <36C058B7.F6A44324@informatik.uni-wuerzburg.de>

> > Matt Curtin  <········@interhack.net> wrote:
> > >
> > > Let's see ... what other (programming) languages might support this
> > > sort of thing?  How about Lisp?  XEmacs Lisp, at that?
> > >
> > >    (defun �crivez (m)
> > >      "�crivez un message."
> > >      (message m))
> > >
> > > Sure enough, evaluating the expression `(�crivez "Bonjour, monde!")'
> > > will write `Bonjour, monde!' in the minibuffer completely sans
> > > complaint.
> >
> > CLISP Common Lisp has no problems with this either:
> >
> > [1]> (defun �crivez (m)
> >        "�crivez un message."
> >        (write-line m))
> > �CRIVEZ
> > [2]> (�crivez "Bonjour, tout le monde!")
> > Bonjour, tout le monde!
> > "Bonjour, tout le monde!"
> > [3]>
> > � bient�t!
> >
> > The next release of CLISP will not only be 8-bit clean, it will also support
> > Unicode 16-bit characters.
> >
> 
> This is a good thing, althogh only relatively.
> 
> If you are going to release a Unicode version (or UTF8) of CLISP, it'd
> probably be a good thing to post some document that describes the
> addendum so that other implementors can take advantage of it.
> 
> On top of that, since it seems there will be another round of the ANSI
> CL committee, this could be meat for that venue.

It would be a very nice feature to have several CL-Functions localized,
so you dont have to invent your own routines for do this. I like to
mention
- format (with date, time, floats)
- char-upcase etc. (eg Allegro is wrong when german special chars
are     involved)
- Daylight saving time & time zones

Since every serious OS supports localization LISP-Implementations should
be forced to use these.

Johannes Beck

--
Johannes Beck   ····@informatik.uni-wuerzburg.de
                http://www-info6.informatik.uni-wuerzburg.de/~beck/
                Tel.: +49 931 312198
		Fax.: +49 931 7056120
                
PGP Public Key available by ·············@informatik.uni-wuerzburg.de

From: Thomas A. Russ
Subject: Re: 8-bit input (or, "Perl attacks on non-English language   communities!")
Date: Tue, 09 Feb 1999 00:00:00 +0000
Message-ID: <ymipv7j9nlz.fsf@sevak.isi.edu>

Johannes Beck <····@informatik.uni-wuerzburg.de> writes:

> It would be a very nice feature to have several CL-Functions localized,
> so you dont have to invent your own routines for do this. I like to
> mention
> - format (with date, time, floats)

(format t "Let me be the ~:R ~R to mention format with ~~R!" 1 1)



-- 
Thomas A. Russ,  USC/Information Sciences Institute          ···@isi.edu

From: Kent M Pitman
Subject: Re: 8-bit input (or, "Perl attacks on non-English language   communities!")
Date: Wed, 10 Feb 1999 00:00:00 +0000
Message-ID: <sfw90e7oskz.fsf@world.std.com>

···@sevak.isi.edu (Thomas A. Russ) writes:

> Johannes Beck <····@informatik.uni-wuerzburg.de> writes:
> 
> > It would be a very nice feature to have several CL-Functions localized,
> > so you dont have to invent your own routines for do this. I like to
> > mention
> > - format (with date, time, floats)
> 
> (format t "Let me be the ~:R ~R to mention format with ~~R!" 1 1)

(format t ··@R'll ~:R that.  ~
           And ·@R want to be ··@*~:R to take a ~:R out to mention ··@R!"
           1 2 1)

From: Erik Naggum
Subject: Re: 8-bit input (or, "Perl attacks on non-English language   communities!")
Date: Thu, 11 Feb 1999 00:00:00 +0000
Message-ID: <3127698009640562@naggum.no>

* Johannes Beck <····@informatik.uni-wuerzburg.de>
| It would be a very nice feature to have several CL-Functions localized,
| so you dont have to invent your own routines for do this.

  localization and internationalization has done more to destroy what was
  left of intercultural communication and respect for cultural needs than
  anything else in the entire computer history.  the people who are into
  these things should not be allowed to work with them for the same reason
  people who want power should be the last people to get it.

| I like to mention
| - format (with date, time, floats)

  date and time should be in ISO 8601.  people who want something else can
  write their own printers (and parsers).  nobody agrees to date-related
  representation even within the same office if left to themselves, let
  alone in a whole country or culture.  it's _much_ worse to put something
  in a standard that people will compete with than not to put something in
  a standard.

  I have a printer and reader for date and time that goes like this:

(describe (get-full-time tz:pacific))
#[1999-02-09 13:16:33.692-08:00] is an instance of #<standard-class full-time>:
 The following slots have :instance allocation:
  sec      3127583793
  msec     692
  zone     #<timezone pacific loaded from "US/Pacific" @ #x202b1e02>
  format   nil

[1970-01-01]
=> 2208988800

(setf *parse-time-default* '(1999 02 09))
[21:19]
=> 3127583940

  a friend of mine commented that using [...] (returns a universal-time)
  and #[...] (returns a full-time object) for this was kind of a luxury
  syntax, but the application this was written for reads and writes dates
  and times millions of times a day.

(format nil "~/ISO:8601/" [21:19])
=> "1999-02-09 21:19:00.000"

(format nil "~920:/ISO:8601/" [21:19])
=> "21:19"

(let ((*default-timezone* tz:oslo))
  (describe (read-from-string "#[22:19]")))
=> #[1999-02-09 22:19:00.000+01:00] is an instance of #<standard-class full-time>:
 The following slots have :instance allocation:
  sec      3127583940
  msec     0
  zone     -1
  format   920

  now, floating-point values.  I _completely_ fail to see the charm of the
  comma as a decimal point, and have never used it.  (I remember how
  grossly unfair I thought it was to be reprimanded for refusing to succumb
  to the ambiguity of the comma in third grade.  it was just too stupid to
  use the same symbol both inside and between numbers, so I used a dot.)
  if you want this abomination, it will be output-only, upon special
  request.  none of the pervase default crap that localization in C uses.
  e.g., a version of Emacs failed mysteriously on Digital Unix systems and
  the maintainers just couldn't figure out why, until the person in
  question admitted to having used Digital Unix's fledling "localiation"
  support.  of course, Emacs Lisp now read floating point numbers in the
  "C" locale to avoid this braindamage.  another pet peeve is that "ls" has
  a ridiculously stupid format, but what do people do?  instead of getting
  it right and reversing the New Jersey stupidity, the just translate it
  _halfway_ into other cultures.  sigh.  and since programs have to deal
  with the output of other programs, there are some things you _can't_ just
  translate without affecting everything.  the result is that people can't
  use these "localizations" except under carefully controlled conditions.

| - char-upcase etc. (eg Allegro is wrong when german special chars are
|   involved)

  well, I have written functions to deal with this, too.

(system-character-set)
=> #<naggum-software::character-set ASCII @ #x20251be2>
(string-upcase "soylent gr�n ist menschen fleisch!")
=> "SOYLENT GR�N IST MENSCHEN FLEISCH!"
;;;;          ^
(setf (system-character-set) ISO:8859-1)
=> #<naggum-software::character-set ISO 8859-1 @ #x20250f52>
(string-upcase "soylent gr�n ist menschen fleisch!")
=> "SOYLENT GR�N IST MENSCHEN FLEISCH!"
;;;;          ^

  note, upcase rules that deal with �->SS and �->IJ are not implemented;
  this is still a simple character-to-character translation, so it leaves
  these two characters alone.

| - Daylight saving time & time zones

  Common Lisp is too weak in this respect, and so are most other solutions.
  it is wrong to let a time zone be just a number when parsing or decoding
  time specifications.  it is wrong to allow only one time zone to be fully
  supported.  I needed to fix this, so time zone data is fetched from the
  timezone database on demand, since the time zone names need to be loaded
  before they can be referenced.

  e.g., tz:berlin is initialized like this:

(define-timezone #"berlin"	    "Europe/Berlin")

  after which tz:berlin is bound to a timezone object:

(describe tz:berlin) ->
#<timezone berlin lazy-loaded from "Europe/Berlin" @ #x202b1fd2> is an instance
    of #<standard-class timezone>:
 The following slots have :instance allocation:
  name       timezone:berlin
  filename   "Europe/Berlin"
  zoneinfo   <unbound>
  reversed   <unbound>

  using it loads the data automatically:

(get-full-time tz:berlin)
=> #[1999-02-09 22:40:24.517+01:00]
tz:berlin
=> #<timezone berlin loaded from "Europe/Berlin" @ #x202b1fd2>

  you can ask for just the timezone of a particular time and zone, and you
  get the timezone and the universal-time of the previous and next changes,
  so it's possible to know how long a day in local time is without serious
  wastes.  (i.e., it is 23 or 25 hours at the change of timezone due the
  infinitely stupid daylight savings time crap, but people won't switch to
  UTC, so have to accomodate them fully.)

(time-zone [1999-07-04 12:00] tz:pacific)
=> 7 t "PDT" 3132208800 3150349200 

| Since every serious OS supports localization LISP-Implementations should
| be forced to use these.

  I protest vociferously.  let's get this incredible mess right.  if there
  is anything that causes more grief than the mind-bogglingly braindamaged
  attempts that, e.g., Microsoft, does at adapting to other cultures, I
  don't know what it is, and the Unix world is just tailing behind them,
  making the same idiotic mistakes.  IBM has done an incredible job in this
  area, but they _still_ listen to the wrong people, and don't realize that
  there are as many ways to write a date in each language as there are in
  the United States, so calling one particular format "Norwegian" is just
  plain wrong.  forcing one format on all Americans in the silly belief
  that they are all alike would would perhaps cause sufficient rioting to
  get somebody's attention, because countries with small populations than
  some U.S. cities just won't be heard.

  e.g., if you want to use the supposedly "standard" Norwegian notation,
  that's 9.2.99, but people will want to write 9/2-99 or 9/2 1999, and if
  you do this, those who actually have to communicate with people elsewhere
  in the world will now be crippled unless they turn _off_ this cultural
  braindamage, and revert to whatever choice they get with the default.

  computers and programmers should speak English.  if you want to talk to
  people in your own culture, first consider international standards that
  get things right (like ISO 8601 for dates and times), then the smartest
  thing you can think of, onwards through to the stupidest thing you can
  think of, then perhaps what people have failed to understand is wrong.
  you don't have to adapt to anyone -- nobody adapts to you, and adapting
  should be a reciprocal thing, so do whatever is right and explain it to
  people.  90% of them will accept it.  the rest can go write their own
  software.  force accountants to see four-digit years, force Americans and
  the British to see 24-hour clocks, use dot as a decimal point, write
  dates and times with all numbers in strictly decreasing unit order, lie
  to managers when they ask if they can have the way they learned stuff in
  grade school in 1950 and say it's impossible in this day and age.
  computers should be instruments of progress.  if that isn't OK with some
  doofus, give him a keypunch, which is what computers looked like at the
  time the other things they ask the computers do to day was normal.  if
  people want you to adapt, put them to the test and see if they think
  adaptation is any good when it happens to themselves.  if it does, great
  -- they do what you say.  if not, you tell them "neither do I", and force
  them to accept your way, anyway.  it's that simple.

#:Erik
-- 
  Y2K conversion simplified: Januark, Februark, March, April, Mak, June,
  Julk, August, September, October, November, December.

From: Christopher R. Barry
Subject: Re: 8-bit input (or, "Perl attacks on non-English language communities!")
Date: Tue, 09 Feb 1999 00:00:00 +0000
Message-ID: <873e4fbawi.fsf@2xtreme.net>

Marco Antoniotti <·······@copernico.parades.rm.cnr.it> writes:

[16-bit characters galore deleated]

> On top of that, since it seems there will be another round of the ANSI
> CL committee, this could be meat for that venue.

Hmmm... MULL (MUlti Lingual Lisp)? This isn't going to make users of
8-bit character sets experience increased storage overhead for the
exact same string objects and a performance hit in string bashing
functions, now is it?

On the upside, unicode support could give an additional excuse for
Lisp's apparent "slowness" in certain situations. In my Java class the
instructor seems to always bring up unicode support as part of the
excuse for Java's lousy performance (hmm... this isn't really
comforting for some reason though...).

Christopher

From: Howard R. Stearns
Subject: Re: 8-bit input (or, "Perl attacks on non-English language communities!")
Date: Wed, 10 Feb 1999 00:00:00 +0000
Message-ID: <36C20F62.3FDB5404@elwood.com>

Christopher R. Barry wrote:
> 
> Marco Antoniotti <·······@copernico.parades.rm.cnr.it> writes:
> 
> [16-bit characters galore deleated]
> 
> > On top of that, since it seems there will be another round of the ANSI
> > CL committee, this could be meat for that venue.
> 
> Hmmm... MULL (MUlti Lingual Lisp)? This isn't going to make users of
> 8-bit character sets experience increased storage overhead for the
> exact same string objects and a performance hit in string bashing
> functions, now is it?
> 
> On the upside, unicode support could give an additional excuse for
> Lisp's apparent "slowness" in certain situations. In my Java class the
> instructor seems to always bring up unicode support as part of the
> excuse for Java's lousy performance (hmm... this isn't really
> comforting for some reason though...).
> 
> Christopher

You've just ran into what I believe is a misunderstanding. One that is
one of my pet peaves.  A while back in "IEEE Computer" magazine, some
yahoo decided that we don't need to use 16 bits to handle international
characters.  Instead, we usually only need 8 bits at a time, and that we
would get better performance by using 8-bit characters for everything
along with a locally understood "current char set".  They eventually
printed a "letter to the editor" I sent, and the whole thing bugs me
enough that I'm going to repeat it here.

One issue you bring up that is not covered in the letter is whether
speed is effected in Lisp by simultaneously supporting BOTH ASCII and
Unicode.  I admit that runtime dispatching between the two different
string representations would cost time if the compiler can't figure out
at compile time which is being used.  In principle, proper declarations
fixes this.  Of course, if you don't have any declarations at all, then
dispatching between all the different sequence types won't be any more
expensive if there are two kinds of strings possible.  However, its true
that there is a volume of code (including system supplied macros) which
declare things to simply be of type string (as oppossed to base-string
or extended-string), and the benefits of these declarations might be
mostly lost if there are two kinds of strings.  One solution would be
for an implementation to simply ALWAYS use extended-strings, and it is
either this situation or fully declared code that is assumed in the
letter.

Anyway, here's my rant on trying to save space by using 8-bit characters
everywhere.  If I'm rattling this off to quickly, I'll be happy to
expand on any of the points.

----
I am confused by Neville Holmes essay "Toward Decent Text Encoding"
(Computer,
Aug. 1998, p. 108). If I understand correctly, Holmes argues that 16-bit
Unicode
encoding of characters wastes space. Instead, different regions of the
world
should use an implicitly understood 8-bit local encoding. Have we truly
learned
nothing from the Y2K problem?

My understanding is that

    It is not quite correct to refer to Unicode as a 16-bit standard.
Unicode
actually uses a 32-bit space. It is one of the more popular subsets of
Unicode, UCS-2, that happens to fit in 16 bits.

    The distinction between in-memory encoding and external
representation cannot be overemphasized. Within a program, many
algorithms rely for their efficiency on being able to assume that
characters within a string are represented using a uniform width. But
uniform width is much less important for external representations.
Therefore, multibyte and switching encodings can be
used for data transfer. By definition, it is necessary to convert
external
representations to uniform-width representations only when slow external
resources are involved. Throughput need not be affected.

    Performance should not be greatly effected by the choice of 8-bit
or 16-bit uniform representation within memory. On the other hand,
using nonuniform or shifting encodings in memory would have a much
greater effect on performance.

    Interface performance is also not an issue. When sending character
data to a file or over the Internet, compression or alternate encodings
can be used. For example, the UTF-8 encoding of Unicode is, bit-for-bit,
identical to ASCII encoding when used for text that happens to contain
only ASCII characters.

    Many international and de facto standards involving written
representations, especially for programming languages, include keywords
and punctuation from the European/Latin local encoding. Within such
documents, then, this 8-bit system must coexist simultaneously with
other (presumably 8-bit) "local" encodings. I do not believe, therefore,
that outside Europe and North America, only 8-bits would be consistently
sufficient within a single application or even within a single document.

From: Erik Naggum
Subject: Re: 8-bit input (or, "Perl attacks on non-English language communities!")
Date: Thu, 11 Feb 1999 00:00:00 +0000
Message-ID: <3127703943282468@naggum.no>

* "Howard R. Stearns" <······@elwood.com>
| One that is one of my pet peaves.  A while back in "IEEE Computer"
| magazine, some yahoo decided that we don't need to use 16 bits to handle
| international characters.  Instead, we usually only need 8 bits at a
| time, and that we would get better performance by using 8-bit characters
| for everything along with a locally understood "current char set".  They
| eventually printed a "letter to the editor" I sent, and the whole thing
| bugs me enough that I'm going to repeat it here.

  the first ISO 10646 draft actually had this feature for character sets
  that only need 8 bits, complete with "High Octet Prefix", which was
  intended as a stateful encoding that would _never_ be useful in memory.
  this was a vastly superior coding scheme to UTF-8, which unfortunately
  penalizes everybody outside of the United States.  I actually think UTF-8
  is one of the least intelligent solutions to this problem around: it
  thwarts the whole effort of the Unicode Consortium and has already proven
  to be a reason why Unicode is not catching on.

  instead of this stupid encoding, only a few system libraries need to be
  updated to understand the UCS signature, #xFEFF, at the start of strings
  or streams.  it can even be byte-swapped without loss of information.  I
  don't think two bytes is a great loss, but the stateless morons in New
  Jersey couldn't be bothered to figure something like this out.  argh!

  when the UCS signature becomes widespread, any string or stream can be
  viewed initially as a byte sequence, and upon first access can easily be
  inspected for its true nature and the object can then change class into
  whatever the appropriate class should be.  it might even be byteswapped
  if appropriate.  this is not at all rocket science.  I think the UCS
  signature is among the smarter things in Unicode.  that #xFFFE is an
  invalid code and #xFEFF is a zero-width space are signs of a brilliant
  mind at work.  I don't know who invented this, but I _do_ know that UTF-8
  is a New Jersey-ism.

| One issue you bring up that is not covered in the letter is whether speed
| is effected in Lisp by simultaneously supporting BOTH ASCII and Unicode.

  there is actually a lot of evidence that UTF-8 slows things down because
  it has to be translated, but UTF-16 can be processed faster than ISO
  8859-1 on most modern computers because the memory access is simpler with
  16-bit units than with 8-bit units.  odd addresses are not free.

| It is not quite correct to refer to Unicode as a 16-bit standard.
| Unicode actually uses a 32-bit space. It is one of the more popular
| subsets of Unicode, UCS-2, that happens to fit in 16 bits.

  well, Unicode 1.0 was 16 bits, but Unicode is now 16 bits + 20 bits worth
  of extended space encoded as 32 bits using 1024 high and 1024 low codes
  from the set of 16-bit codes.  ISO 10646 is a 31-bit character set
  standard without any of this stupid hi-lo cruft.

  your point about the distinction between internal and external formats is
  generally lost on people who have never seen the concepts provided by the
  READ and WRITE functions in Common Lisp.  Lispers are used to dealing
  with different internal and external representations, and therefore have
  a strong propensity to understand much more complex issues than people
  who are likely to argue in favor of writing raw bytes from memory out to
  files as a form of "interchange", and who deal with all text as _strings_
  and repeatedly maul them with regexps.

  my experience is that there's no point in trying to argue with people who
  don't understand the concepts of internal and external representation --
  if you want to reach them at all, that's where you have to start, but be
  prepared for a paradigm shift happening in your audience's brain.  (it
  has been instructive to see how people suddenly grasp that a date is
  always read and written in ISO 8601 format although the machine actually
  deals with it as a large integer, the number of seconds since an epoch.
  Unix folks who are used to seeing the number _or_ a hacked-up crufty
  version of `ctime' output are truly amazed by this.)  if you can explain
  how and why conflating internal and external representation is bad karma,
  you can usually watch them people get a serious case of revelation and
  their coding style changes there and then.  but just criticizing their
  choice of an internal-friendly external coding doesn't ring a bell.

#:Erik
-- 
  Y2K conversion simplified: Januark, Februark, March, April, Mak, June,
  Julk, August, September, October, November, December.

From: Howard R. Stearns
Subject: Re: 8-bit input (or, "Perl attacks on non-English language communities!")
Date: Thu, 11 Feb 1999 00:00:00 +0000
Message-ID: <36C2F3A1.7BADB65A@elwood.com>

Erik Naggum wrote:
> ...
>   your point about the distinction between internal and external formats is
>   generally lost on people who have never seen the concepts provided by the
>   READ and WRITE functions in Common Lisp.  Lispers are used to dealing
>   with different internal and external representations, and therefore have
>   a strong propensity to understand much more complex issues than people
>   who are likely to argue in favor of writing raw bytes from memory out to
>   files as a form of "interchange", and who deal with all text as _strings_
>   and repeatedly maul them with regexps.
> 
>   my experience is that there's no point in trying to argue with people who
>   don't understand the concepts of internal and external representation --
>   if you want to reach them at all, that's where you have to start, but be
>   prepared for a paradigm shift happening in your audience's brain.  (it
>   has been instructive to see how people suddenly grasp that a date is
>   always read and written in ISO 8601 format although the machine actually
>   deals with it as a large integer, the number of seconds since an epoch.
>   Unix folks who are used to seeing the number _or_ a hacked-up crufty
>   version of `ctime' output are truly amazed by this.)  if you can explain
>   how and why conflating internal and external representation is bad karma,
>   you can usually watch them people get a serious case of revelation and
>   their coding style changes there and then.

EMPHASIS ON THE FOLLOWING!

>                                                but just criticizing their
>   choice of an internal-friendly external coding doesn't ring a bell.
> 
> #:Erik

I think that's good advice.  On the grounds that Lispers are used to
distinguishing between internal and external representations for objects
in general, programs, structures, lists, floating point numbers, dates,
etc., I'll repeat something here that people might use in evangalizing,
er, trying to explain things to others:

  I/O is much slower than computation!

For example, consider mobile code that you want to distribute around the
internet. With limited bandwidth and modern processors, its often faster
to send a compressed encoding (i.e. gzcat) of the source code for some
expressive language (like Lisp), and then, on the receiving end, to
uncompress and compile it, then it is to send the "binaries".  (I think
there was a paper on this, with numbers, within the last 2 years
somewhere.  Comm. of the ACM?)

There are similar processing wins for dealing with file-systems and just
about anything with moving parts.  In some cases, it's even better to do
lots more computation within a program just to avoid having to do a lot
with system calls to the operating system.  

The point is that if you want to make a program fast, do it right.  When
taken with Erik's other point about how 16 (or even 32) bit characters
may be faster than 8-bit anyway (due to word access in memory (Erik: I'm
not sure if this stays true with instruction prefecthing, etc.)), one
can support Unicode characters within programs, and then do whatever is
needed on I/O.  

Don't worry about the extra space WITHIN MEMORY, and use processing
power to get rid of the extra space EXTERNALLY when needed.

From: Johan Kullstam
Subject: Re: 8-bit input (or, "Perl attacks on non-English language communities!")
Date: Thu, 11 Feb 1999 00:00:00 +0000
Message-ID: <u3e4cc4bc.fsf@res.raytheon.com>

Erik Naggum <····@naggum.no> writes:

>   my experience is that there's no point in trying to argue with people who
>   don't understand the concepts of internal and external representation --
>   if you want to reach them at all, that's where you have to start, but be
>   prepared for a paradigm shift happening in your audience's brain.  (it
>   has been instructive to see how people suddenly grasp that a date is
>   always read and written in ISO 8601 format although the machine actually
>   deals with it as a large integer, the number of seconds since an epoch.
>   Unix folks who are used to seeing the number _or_ a hacked-up crufty
>   version of `ctime' output are truly amazed by this.)  if you can explain
>   how and why conflating internal and external representation is bad karma,
>   you can usually watch them people get a serious case of revelation and
>   their coding style changes there and then.  but just criticizing their
>   choice of an internal-friendly external coding doesn't ring a
>   bell.

just to throw my two cents in

unix does do the internal/external dance in the filesystem.  files are
known to the user by a string.  files are identified to the system by
inode number.  directories are simply maps from string to number.
from the user's perspective, the inode numbers hardly exist at all.
(that the mapping ought to be performed by a hash table rather than an
association list not relevant to the abstraction at work.)  this
pathname abstraction is one of the things that unix does do correctly.

the same principle should be applied to char-sets.  you shouldn't care
what bits it takes to represent the letter `A'.  it should be taken
out of the user's domain to prevent worrying about it and getting it
wrong.  the user generally assumes a constant width character
representation.  breaking this invites trouble.  worrying about
optimization of space should be delayed.

to support all the languages, i think 16 bit chars sounds like a good
thing.  i wouldn't mind having a cpu which could only address 16 bit
bytes (it'd double your address space in one fell swoop) and say
goodbye to 8 bit processing altogether.  a few (un)packing
instructions could be thrown in for space saving representations.

after all, how much is text these days anyway?  if you doubled the
size of the text files, how much more disk could it possibly use?  if
you are worried about it, a second packed 8 bit system could be used.
better yet a huffman table (for your language or application) could be
applied on small sections perhaps (roughly) a line by line basis.
resulting in far higher data density than 8 bits per char.  this
packing should be, as much as possible, transparent to the user.

-- 
johan kullstam

From: Christopher R. Barry
Subject: Re: 8-bit input (or, "Perl attacks on non-English language communities!")
Date: Thu, 11 Feb 1999 00:00:00 +0000
Message-ID: <87pv7gagwq.fsf@2xtreme.net>

Johan Kullstam <········@ne.mediaone.net> writes:

[...]

> to support all the languages, i think 16 bit chars sounds like a good
> thing.  i wouldn't mind having a cpu which could only address 16 bit
> bytes (it'd double your address space in one fell swoop) and say
> goodbye to 8 bit processing altogether.  a few (un)packing
> instructions could be thrown in for space saving representations.
> 
> after all, how much is text these days anyway?  if you doubled the
> size of the text files, how much more disk could it possibly use?

Well, since I roughly estimate that over 80% of the files on the
filesystems of my 9GB disk are ASCII encoded source code and
postscript, html and text documentation and configuration files, quite
a bit.

Christopher

From: Erik Naggum
Subject: Re: 8-bit input (or, "Perl attacks on non-English language communities!")
Date: Thu, 11 Feb 1999 00:00:00 +0000
Message-ID: <3127701651609043@naggum.no>

* ······@2xtreme.net (Christopher R. Barry)
| Hmmm... MULL (MUlti Lingual Lisp)?

  give me a break.  Common Lisp has all it needs to move to a smart wide
  character set such as Unicode.  we even support external character set
  codings in the :EXTERNAL-FORMAT argument to stream functions.  it's all
  there.  all the stuff that is needed to handle input and output should
  also be properly handled by the environment -- if not, there's no use for
  such a feature since you can neither enter nor display nor print Unicode
  text.

| This isn't going to make users of 8-bit character sets experience
| increased storage overhead for the exact same string objects and a
| performance hit in string bashing functions, now is it?

  there are performance reasons to use 16 bits per character over 8 bits in
  modern hardware already, but if you need only 8 bits, use BASE-STRING
  instead of STRING.  it's only a vector, anyway, and Common Lisp can
  already handle specialized vectors of various size elements.

  if it is important to separate between STRING and BASE-STRING, I'm sure a
  smart implementation would do the same for strings as the standard does
  for floats: *READ-DEFAULT-FLOAT-FORMAT*.

| On the upside, unicode support could give an additional excuse for Lisp's
| apparent "slowness" in certain situations.  In my Java class the
| instructor seems to always bring up unicode support as part of the excuse
| for Java's lousy performance (hmm... this isn't really comforting for
| some reason though...).

  criminy.  can teachers be sued for malpractice?  if so, go for it.

#:Erik
-- 
  Y2K conversion simplified: Januark, Februark, March, April, Mak, June,
  Julk, August, September, October, November, December.

From: Rainer Joswig
Subject: Re: 8-bit input (or, "Perl attacks on non-English language communities!")
Date: Tue, 09 Feb 1999 00:00:00 +0000
Message-ID: <joswig-0902991446230001@pbg3.lavielle.com>

In article <············@news.u-bordeaux.fr>, ······@clisp.cons.org (Bruno Haible) wrote:

> [1]> (defun �crivez (m)
>        "�crivez un message."
>        (write-line m))
> �CRIVEZ
> [2]> (�crivez "Bonjour, tout le monde!")
> Bonjour, tout le monde!
> "Bonjour, tout le monde!"
> [3]> 
> � bient�t!
> 
> The next release of CLISP will not only be 8-bit clean, it will also support
> Unicode 16-bit characters.
> 
>               Bruno                              http://clisp.cons.org/

With Macintosh Common Lisp you can program in Kanji, if you like. ;-)

-- 
http://www.lavielle.com/~joswig