From: Thomas Bushnell, BSG
Subject: Wide character implementation
Date: 
Message-ID: <87wuw92lhc.fsf@becket.becket.net>
If one uses tagged pointers, then its easy to implement fixnums as
ASCII characters efficiently.

But suppose one wants to have the character datatype be 32-bit Unicode
characters?  Or worse yet, 35-bit Unicode characters?

At the same time, most characters in the system will of course not be
wide.  What are the sane implementation strategies for this?

From: Frode Vatvedt Fjeld
Subject: Re: Wide character implementation
Date: 
Message-ID: <2hit7sylec.fsf@vserver.cs.uit.no>
·········@becket.net (Thomas Bushnell, BSG) writes:

> If one uses tagged pointers, then its easy to implement fixnums as
> ASCII characters efficiently.

Hm.. perhaps you mean it's easy to implement characters as immediate
values?

> But suppose one wants to have the character datatype be 32-bit
> Unicode characters?  Or worse yet, 35-bit Unicode characters?
>
> At the same time, most characters in the system will of course not
> be wide.  What are the sane implementation strategies for this?

I suppose to assign "most characters in the system" to a sub-type of
the wide characters, and implement that sub-type as immediates.

-- 
Frode Vatvedt Fjeld
From: Pierpaolo BERNARDI
Subject: Re: Wide character implementation
Date: 
Message-ID: <hjEl8.3381$pT1.74251@news1.tin.it>
"Thomas Bushnell, BSG" <·········@becket.net> ha scritto nel messaggio
···················@becket.becket.net...
>
> If one uses tagged pointers, then its easy to implement fixnums as
> ASCII characters efficiently.
>
> But suppose one wants to have the character datatype be 32-bit Unicode
> characters?  Or worse yet, 35-bit Unicode characters?

21 bits are enough for Unicode.

P.
From: Janis Dzerins
Subject: Re: Wide character implementation
Date: 
Message-ID: <87d6y0ztcn.fsf@asaka.latnet.lv>
"Pierpaolo BERNARDI" <··················@hotmail.com> writes:

> "Thomas Bushnell, BSG" <·········@becket.net> ha scritto nel messaggio
> ···················@becket.becket.net...
> >
> > If one uses tagged pointers, then its easy to implement fixnums as
> > ASCII characters efficiently.
> >
> > But suppose one wants to have the character datatype be 32-bit Unicode
> > characters?  Or worse yet, 35-bit Unicode characters?
> 
> 21 bits are enough for Unicode.

What "Unicode"?

-- 
Janis Dzerins

  Eat shit -- billions of flies can't be wrong.
From: Pierpaolo BERNARDI
Subject: Re: Wide character implementation
Date: 
Message-ID: <_fIl8.5053$S52.129847@news2.tin.it>
"Janis Dzerins" <·····@latnet.lv> ha scritto nel messaggio
···················@asaka.latnet.lv...
> "Pierpaolo BERNARDI" <··················@hotmail.com> writes:
>
> > "Thomas Bushnell, BSG" <·········@becket.net> ha scritto nel messaggio
> > ···················@becket.becket.net...
> > >
> > > If one uses tagged pointers, then its easy to implement fixnums as
> > > ASCII characters efficiently.
> > >
> > > But suppose one wants to have the character datatype be 32-bit Unicode
> > > characters?  Or worse yet, 35-bit Unicode characters?
> >
> > 21 bits are enough for Unicode.
>
> What "Unicode"?

The character encoding standard defined by the Unicode Consortium Inc.,
Are there other Unicodes?

P.
From: lin8080
Subject: Re: Wide character implementation
Date: 
Message-ID: <3C97873E.2C1F3852@freenet.de>
Janis Dzerins schrieb:

> "Pierpaolo BERNARDI" <··················@hotmail.com> writes:

> > "Thomas Bushnell, BSG" <·········@becket.net> ha scritto nel messaggio
> > ···················@becket.becket.net...

> > 21 bits are enough for Unicode.
> 
> What "Unicode"?

Try:

http://www.linuxdoc.org/HOWTO/Unicode-HOWTO.html

http://www.cl.cam.ac.uk/~mgk25/unicode.html


stefan
From: Erik Naggum
Subject: Re: Wide character implementation
Date: 
Message-ID: <3225568524784163@naggum.net>
* Janis Dzerins <·····@latnet.lv>
| What "Unicode"?

  unicode.org

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Thomas Bushnell, BSG
Subject: Re: Wide character implementation
Date: 
Message-ID: <87r8mgchmp.fsf@becket.becket.net>
"Pierpaolo BERNARDI" <··················@hotmail.com> writes:

> 21 bits are enough for Unicode.

Um, Unicode version 3.1.1 has the following as the largest character:

E007F;CANCEL TAG;Cf;0;BN;;;;;N;;;;;

Now the Unicode space isn't sparse, but I don't think compressing the
space is the most efficient strategy.
From: Tim Moore
Subject: Re: Wide character implementation
Date: 
Message-ID: <a78hq3$rcd$0@216.39.145.192>
On 19 Mar 2002 14:33:34 -0800, Thomas Bushnell, BSG <·········@becket.net>
 wrote:
>"Pierpaolo BERNARDI" <··················@hotmail.com> writes:
>
>> 21 bits are enough for Unicode.
>
>Um, Unicode version 3.1.1 has the following as the largest character:
>
>E007F;CANCEL TAG;Cf;0;BN;;;;;N;;;;;
>
>Now the Unicode space isn't sparse, but I don't think compressing the
>space is the most efficient strategy.
>

Um, what's your point? E007f fits in 20 bits.  If you're thinking
that's all that's needed, there are private use areas (E000..F8FF,
F0000..FFFFD, and 100000..10FFFD) that need to be encoded too.  So 21
bits looks right.

Tim
From: Thomas Bushnell, BSG
Subject: Re: Wide character implementation
Date: 
Message-ID: <87ofhkazo4.fsf@becket.becket.net>
······@sea-tmoore-l.dotcast.com (Tim Moore) writes:

> Um, what's your point? E007f fits in 20 bits.  If you're thinking
> that's all that's needed, there are private use areas (E000..F8FF,
> F0000..FFFFD, and 100000..10FFFD) that need to be encoded too.  So 21
> bits looks right.

Oh what an embarrassing brain fart, yes that's quite right.  I don't
know what I was counting, but my head was clearly on backwards.
From: Ben Goetter
Subject: Re: Wide character implementation
Date: 
Message-ID: <a77q1h$jgj$0@216.39.136.5>
Quoth Pierpaolo BERNARDI:
> "Thomas Bushnell, BSG" <·········@becket.net> ha scritto 
> > But suppose one wants to have the character datatype be 32-bit Unicode
> > characters?  Or worse yet, 35-bit Unicode characters?
> 
> 21 bits are enough for Unicode.

And ISO 10646, per working group resolution.

http://std.dkuug.dk/JTC1/SC2/WG2/docs/n2175.htm
http://std.dkuug.dk/JTC1/SC2/WG2/docs/n2225.doc
From: Erik Naggum
Subject: Re: Wide character implementation
Date: 
Message-ID: <3225524036151618@naggum.net>
* Thomas Bushnell, BSG
| If one uses tagged pointers, then its easy to implement fixnums as
| ASCII characters efficiently.

  Huh?  No sense this makes.

| But suppose one wants to have the character datatype be 32-bit Unicode
| characters?  Or worse yet, 35-bit Unicode characters?

  Unicode is a 31-bit character set.  The base multilingual plane is 16
  bits wide, and then there are the possibility of 20 bits encoded in two
  16-bit values with values from 0 to 1023, effectively (+ (expt 2 20) (-
  (expt 2 16) 1024 1024)) => 1112064 possible codes in this coding scheme,
  but one does not have to understand the lo- and hi-word codes that make
  up the 20-bit character space.  In effect, you need 16 bits.  Therefore,
  you could represent characters with the following bit pattern, with b for
  bits and c for code.  Fonts are a mistake, so is removed.

000000ccccccccccccccccccccc00110

  This is useful when the fixnum type tag is either 000 for even fixnums
  and 100 for odd fixnums, effectively 00 for fixnums.  This makes
  char-code and code-char a single shift operation.  Of course, char-bits
  and char-font are not supported in this scheme, but if you _really_ have
  to, the upper 4 bits may be used for char-bits.

| At the same time, most characters in the system will of course not be
| wide.  What are the sane implementation strategies for this?

  I would (again) recommend actually reading the specification.  The
  character type can handle everything, but base-char could handle the
  8-bit things that reasonable people use.  The normal string type has
  character elements while base-string has base-char elements.  It would
  seem fairly reasonable to implement a *read-default-string-type* that
  would take string or base-string as value if you choose to implement both
  string types.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Sander Vesik
Subject: Re: Wide character implementation
Date: 
Message-ID: <1016555222.461162@haldjas.folklore.ee>
In comp.lang.scheme Erik Naggum <····@naggum.net> wrote:
> * Thomas Bushnell, BSG
> | If one uses tagged pointers, then its easy to implement fixnums as
> | ASCII characters efficiently.
> 
>  Huh?  No sense this makes.
> 
> | But suppose one wants to have the character datatype be 32-bit Unicode
> | characters?  Or worse yet, 35-bit Unicode characters?
> 
>  Unicode is a 31-bit character set.  The base multilingual plane is 16
>  bits wide, and then there are the possibility of 20 bits encoded in two
>  16-bit values with values from 0 to 1023, effectively (+ (expt 2 20) (-
>  (expt 2 16) 1024 1024)) => 1112064 possible codes in this coding scheme,
>  but one does not have to understand the lo- and hi-word codes that make
>  up the 20-bit character space.  In effect, you need 16 bits.  Therefore,
>  you could represent characters with the following bit pattern, with b for
>  bits and c for code.  Fonts are a mistake, so is removed.
> 
> 000000ccccccccccccccccccccc00110

I don't  think this is true any more as of unicode 3.1 afaik, 16 bits is
no longer enough.

[snip - this doesn't sound like scheme]

-- 
	Sander

+++ Out of cheese error +++
From: Erik Naggum
Subject: Re: Wide character implementation
Date: 
Message-ID: <3225568713707928@naggum.net>
* Sander Vesik <······@haldjas.folklore.ee>
| I don't  think this is true any more as of unicode 3.1 afaik, 16 bits is
| no longer enough.

  Please pay attention and actually make an effort to read what you respond
  to, will you?  You should also be able to count the number of c bits and
  arrive at a number greater than 16 if you do no get lost on the way.

  Sheesh, some people.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Pekka P. Pirinen
Subject: Re: Wide character implementation
Date: 
Message-ID: <uofhj9pov.fsf@globalgraphics.com>
[comp.lang.lisp only]

Erik Naggum <····@naggum.net> writes:
> * Thomas Bushnell, BSG
> | At the same time, most characters in the system will of course not be
> | wide.  What are the sane implementation strategies for this?
> 
>   [...] The normal string type has character elements while
>   base-string has base-char elements.  It would seem fairly
>   reasonable to implement a *read-default-string-type* that would
>   take string or base-string as value if you choose to implement
>   both string types.

Yes, that's basically it.  

In actual fact, Liquid and Lispworks have
*DEFAULT-CHARACTER-ELEMENT-TYPE* for various functions taking an
:ELEMENT-TYPE argument, and other similar needs.  See
<http://www.xanalys.com/software_tools/reference/lwl42/LWRM-U/html/lwref-u-198.htm#pgfId-1008739>.
Although the doc doesn't say it (there's a lot of unpublished doc on
fat characters), LW:*DEFAULT-CHARACTER-ELEMENT-TYPE* also controls
what kind of strings the reader constructs from the "" syntax.
However, if characters of larger types are seen by the string reader,
a string that can hold these characters is constructed without
complaint.

(This also avoid any confusion from STRING being a supertype of
BASE-STRING.)

Note that it is the programmer's responsibility to choose and declare
suitable character and string types, if they want to write a program
that works efficiently with both BASE-CHAR and larger character sets.
The implementation cannot possibly know enough to make the right
choices.  It can only offer a selection of types and interfaces to
control the types for each language feature.
-- 
Pekka P. Pirinen, Global Graphics Software Limited
In cyberspace, everybody can hear you scream.  - Gary Lewandowski
From: Erik Naggum
Subject: Re: Wide character implementation
Date: 
Message-ID: <3225694477809279@naggum.net>
* Pekka P. Pirinen
| Note that it is the programmer's responsibility to choose and declare
| suitable character and string types, if they want to write a program
| that works efficiently with both BASE-CHAR and larger character sets.

  If they want that, they should always use the types string and character.
  Only if the programmer knows that he creates base-string and with with
  base-char objects, should he so declare them.  Since string is carefully
  worded to be a collection of types, an implementation that declares
  strings exlusively will work for all subtypes of string.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Sander Vesik
Subject: Re: Wide character implementation
Date: 
Message-ID: <1016554947.964486@haldjas.folklore.ee>
In comp.lang.scheme Thomas Bushnell, BSG <·········@becket.net> wrote:
> 
> If one uses tagged pointers, then its easy to implement fixnums as
> ASCII characters efficiently.
> 
> But suppose one wants to have the character datatype be 32-bit Unicode
> characters?  Or worse yet, 35-bit Unicode characters?

They use either UTF8 or UTF16 - you cannot rely on whetvere size
you pick to be suitably long forever, unicode is sort of inherently
variable-length (characters even have too possible representations 
in many cases, &auml; and similar 8-)

> 
> At the same time, most characters in the system will of course not be
> wide.  What are the sane implementation strategies for this?
> 

Implement them as variable-length strings using say UTF-8. Also, saying that
most characters will not be wide may well be a wrong assumptin 8-)

-- 
	Sander

+++ Out of cheese error +++
From: Erik Naggum
Subject: Re: Wide character implementation
Date: 
Message-ID: <3225568971513146@naggum.net>
* Sander Vesik <······@haldjas.folklore.ee>
| They use either UTF8 or UTF16 - you cannot rely on whetvere size
| you pick to be suitably long forever, unicode is sort of inherently
| variable-length (characters even have too possible representations 
| in many cases, &auml; and similar 8-)

  Variable-length characters?  What the hell are you talking about?  UTF-8
  is a variable-length _encoding_ of characters that most certainly are
  intended to require a fixed number of bits.  That is, unless you think
  the digit 3 take up only 6 bits while the letter A takes up 7 bits and
  the symbol � takes up 8.  Then you have variable-length characters.  Few
  people consider this a meaningful way of talking about variable length.

| Implement them as variable-length strings using say UTF-8. Also, saying
| that most characters will not be wide may well be a wrong assumptin 8-)

  Real programming languages work with real character objects, not just
  UTF-8-encoded strings in memory.

  Acquire clue, _then_ post, OK?

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: David Rush
Subject: Re: Wide character implementation
Date: 
Message-ID: <okfofhjiq9f.fsf@bellsouth.net>
Erik Naggum <····@naggum.net> writes:
> * Sander Vesik <······@haldjas.folklore.ee>
> | They use either UTF8 or UTF16 - you cannot rely on whetvere size
> | you pick to be suitably long forever, unicode is sort of inherently
> | variable-length (characters even have too possible representations 
> | in many cases, &auml; and similar 8-)
> 
>   Variable-length characters?  What the hell are you talking about?  UTF-8
>   is a variable-length _encoding_ of characters that most certainly are
>   intended to require a fixed number of bits.  That is, unless you think
>   the digit 3 take up only 6 bits while the letter A takes up 7 bits and
>   the symbol � takes up 8.  Then you have variable-length characters.  Few
>   people consider this a meaningful way of talking about variable length.

Erik, this is beneath you. Surely you know that Octet != Character.

>   Acquire clue, _then_ post, OK?

In context, rather pathetic, this seems...

david rush
-- 
The important thing is victory, not persistence.
	-- the Silicon Valley Tarot
From: Erik Naggum
Subject: Re: Wide character implementation
Date: 
Message-ID: <3225694560401617@naggum.net>
* David Rush <····@bellsouth.net>
| Erik, this is beneath you. Surely you know that Octet != Character.

  If you think this is about octets, you are retarded and proud of it.

| >   Acquire clue, _then_ post, OK?
| 
| In context, rather pathetic, this seems...

  Learn of what you speak, _then_ become a snotty asshole, OK?

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Sander Vesik
Subject: Re: Wide character implementation
Date: 
Message-ID: <1016831590.163240@haldjas.folklore.ee>
In comp.lang.scheme Erik Naggum <····@naggum.net> wrote:
> * Sander Vesik <······@haldjas.folklore.ee>
> | They use either UTF8 or UTF16 - you cannot rely on whetvere size
> | you pick to be suitably long forever, unicode is sort of inherently
> | variable-length (characters even have too possible representations 
> | in many cases, &auml; and similar 8-)
> 
>  Variable-length characters?  What the hell are you talking about?  UTF-8
>  is a variable-length _encoding_ of characters that most certainly are
>  intended to require a fixed number of bits.  That is, unless you think
>  the digit 3 take up only 6 bits while the letter A takes up 7 bits and
>  the symbol ? takes up 8.  Then you have variable-length characters.  Few
>  people consider this a meaningful way of talking about variable length.

Wake up, smnell the coffee and learn about 'combiners'. And then *think*
just a little bit, including about thinks like collation, sort order
and similar. 

> 
> ///

-- 
	Sander

+++ Out of cheese error +++
From: Erik Naggum
Subject: Re: Wide character implementation
Date: 
Message-ID: <3225841444459787@naggum.net>
* Sander Vesik
| Wake up, smnell the coffee and learn about 'combiners'.  And then *think*
| just a little bit, including about thinks like collation, sort order and
| similar.

  Perhaps you are unaware of the character concept as used in Unicode?  It
  would seem prudent at this time for you to return to the sources and
  obtain the information you lack.  To wit, what you incompetently refer to
  as "combiners" are actually called "combining characters".  I suspect you
  knew that, too, since nobody _else_ calls them "combiners".  But it seems
  that you are fighting for your honor, now, not technical correctness, and
  I shall leave to you another pathetic attempt to feel good about yourself
  when you should acknowledge inferior knowledge and learn something.

  Oh, by the way, Unicode has three levels.  Study Unicode, and you will
  know that they mean and what they do.  Hint: "variable-length character"
  is an incompetent restatement.  A single _glyph_ may be made up of more
  than one _character_ and a given glyph may be specifed using more than
  one character.  If you had known Unicode at all, you would know this.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Sander Vesik
Subject: Re: Wide character implementation
Date: 
Message-ID: <1016909497.106880@haldjas.folklore.ee>
In comp.lang.scheme Erik Naggum <····@naggum.net> wrote:
> * Sander Vesik
> | Wake up, smnell the coffee and learn about 'combiners'.  And then *think*
> | just a little bit, including about thinks like collation, sort order and
> | similar.
> 
>  Perhaps you are unaware of the character concept as used in Unicode?  It
>  would seem prudent at this time for you to return to the sources and
>  obtain the information you lack.  To wit, what you incompetently refer to
>  as "combiners" are actually called "combining characters".  I suspect you
>  knew that, too, since nobody _else_ calls them "combiners".  But it seems
>  that you are fighting for your honor, now, not technical correctness, and
>  I shall leave to you another pathetic attempt to feel good about yourself
>  when you should acknowledge inferior knowledge and learn something.

I don't subscribe to the concept of honour. I also couldn't care less what
you think of me. 

> 
>  Oh, by the way, Unicode has three levels.  Study Unicode, and you will
>  know that they mean and what they do.  Hint: "variable-length character"
>  is an incompetent restatement.  A single _glyph_ may be made up of more
>  than one _character_ and a given glyph may be specifed using more than
>  one character.  If you had known Unicode at all, you would know this.

It is pointless to think of glyph in any other way than characters - it should
not make any difference whetever adiaresis is represented by one code point
- the precombined one - or two. In fact, if there is a detctable difference
from anything dealing with text strings the implementation is demonstratably
broken.


> 
> ///

-- 
	Sander

+++ Out of cheese error +++
From: Erik Naggum
Subject: Re: Wide character implementation
Date: 
Message-ID: <3225923202075012@naggum.net>
* Sander Vesik
| I also couldn't care less what you think of me.

  You should realize that only people who care a lot, make this point.

| It is pointless to think of glyph in any other way than characters - it
| should not make any difference whetever adiaresis is represented by one
| code point - the precombined one - or two.  In fact, if there is a
| detctable difference from anything dealing with text strings the
| implementation is demonstratably broken.

  It took the character set community many years to figure out the crucial
  conceptual and then practical difference between the "characteristic
  glyph" of a character and the character itself, namly that a character
  may have more than one glyph, and a glyph may represent more than one
  character.  If you work with characters as if they were glyphs, you
  _will_ lose, and you make just the kind of arguments that were made by
  people who did _not_ grasp this difference in the ISO committees back in
  1992 and who directly or indirectly caused Unicode to win over the
  original ISO 10646 design.  Unicode has many concessions to those who
  think character sets are also glyph sets, such as the presentation forms,
  but that only means that there are different times you would use
  different parts of the Unicode code space.  Some people who try to use
  Unicode completely miss this point.

  It also took some _companies_ a really long time to figure the difference
  between glyph sets and character sets.  (E.g., Apple and Xerox, and, of
  course, Microsoft has yet to reinvent the distinction badly in the name
  of "innovation", so their ISO 8859-1-like joke violates important rules
  for character sets.)  I see that you are still in the pre-enlightenment
  state of mind and have failed to grasp what Unicode does with its three
  levels.  I cannot help you, since you appear to stop thinking in order to
  protect or defend yourself or whatever (it sure looks like som mideast
  "honor" codex to me), but if you just pick up the standard and read its
  excellent introductions or even Unicode: A Primer, by Tony Graham, you
  will understand a lot more.  It does an excellent job of explaining the
  distinction between glyph and character.  I think you need it much more
  than trying to defend yourself by insulting me with your ignorance.

  Now, if you want to use or not use combining characters, you make an
  effort to convert your input to your preferred form before you start
  processing.  This isolates the "problem" to a well-defined interface, and
  it is no longer a problem in properly designed systems.  If you plan to
  compare a string with combining characters with one without them, you are
  already so confused that there is no point in trying to tell you how
  useless this is.  This means that thinking in terms of "variable-length
  characters" is prima facie evidence of a serious lack of insight _and_ an
  attitude problem that something somebody else has done is wrong and that
  you know better than everybody else.  Neither are problems with Unicode.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Thomas Bushnell, BSG
Subject: Re: Wide character implementation
Date: 
Message-ID: <87it7m4mnm.fsf@becket.becket.net>
So a secondary question; if one is designing a new Common Lisp or
Scheme system, and one is not encumbered by any requirements about
being consistent with existing code, existing operating systems, or
existing communications protocols and interchange formats: that is, if
one gets to design the world over again:

Should the Scheme/CL type "character" hold Unicode characters, or
Unicode glyphs?  (It seems clear to me that it should hold characters,
but I might be thinking about it poorly.)

And, whichever answer, why is that the right answer?

Thomas
From: cr88192
Subject: Re: Wide character implementation
Date: 
Message-ID: <u9qn9p1p3njfb3@corp.supernews.com>
> 
> Should the Scheme/CL type "character" hold Unicode characters, or
> Unicode glyphs?  (It seems clear to me that it should hold characters,
> but I might be thinking about it poorly.)
> 
> And, whichever answer, why is that the right answer?
> 
one could use "the cheap man's unicode" or utf-8.
actually personally I don't care so much about unicode and have held it in 
the "possibly later" respect. for now it is not terribly important as I can 
just restrict myself to the lower 128 characters.
in any case it sounds simpler to implement than the "codepage" system, so I 
will probably use it.

"ich bin einen Amerikaner, und ich tun nicht erweiterter Zeichen noetig" 
(don't mind bad grammar, as I don't really know german...).

nevermind...
From: Erik Naggum
Subject: Re: Wide character implementation
Date: 
Message-ID: <3225941523389213@naggum.net>
* ·········@becket.net (Thomas Bushnell, BSG)
| Should the Scheme/CL type "character" hold Unicode characters, or
| Unicode glyphs?  (It seems clear to me that it should hold characters,
| but I might be thinking about it poorly.)

  There are no Unicode glyphs.  This properly refers to the equivalence of
  a sequence of characters starting with a base character and optinoally
  followed combining characters, and "precomposed" characters.  This is the
  canonical-equivalence of character sequences.  A processor of Unicode
  text is allowed to replace any character sequence with any of its
  canonically-equivalent character sequences.  It is in this regard that an
  application may want to request a particular composite character either
  as one character or a character sequence, and may decide to examine each
  coded character element individually or as an interpreted character.
  These constitute three different levels of interpretation that it must be
  possible to specify.  Since an application is explicitly permitted to
  choose any of the canonical-equivalent character sequences for a
  character, the only reasonable approach is to normalize characters into a
  known internal form.

  There is one crucial restriction on the ability to use equivalent
  character sequences.  ISO 10646 defines implementation levels 1, 2 and 3
  that, respectively, prohibit all combining characters, allow most
  combining characters, and allow all combining characters.  This is a very
  important part of the whole Unicode effort, but Unicode has elected to
  refer to ISO 10646 for this, instead of adopting it.  From my personal
  communication with high-ranking officials in the Unicode consortium, this
  is a political decision, not a technical one, because it was feared that
  implementors that would be happy with trivial character-to-glyph--mapping
  software (such as a conflation of character and glyph concepts and fonts
  that support this conflation), especially in the Latin script cultures,
  would simply drop support for the more complex usage of the Latin script
  and would fail to implement e.g., Greek properly.  Far from being an
  enabling technology, it was feared that implementing the full set of
  equivalences would be omitted and thus not enable the international
  support that was so sought after.  ISO 10646, on the other hand, has
  realized that implementors will need time to get all this right, and may
  choose to defer implementation of Unicode entirely if they are not able
  to do it stepwise.  ISO 10646 Level 1 is intended to be workable for a
  large number of uses, while Level 3 is felt not to have an advantage qua
  requirement until languages that require far more than composition and
  decomposition to be fully supported.  I concur strongly with this.

  The character-to-glyph mapping is fraught with problems.  One possible
  way to do this is actually to use the large private use areas to build
  glyphs and then internally use only non-combining characters.  The level
  of dynamism in the character coding and character-to-glyph mapping here
  is so much difficult to get right that the canonical-equivalent sequences
  of characters (which is a fairly simple table-lookup process) pales in
  comparison.  That is, _if_ you allow combining characters, actually being
  able to display them and reason about them (such as computing widths or
  dealing with character properties of the implicit base character or
  converting their case) is far more difficult than decomposing and
  composing characters.

  As for the scary effect of "variable length" -- if you do not like it,
  canonicalize the input stream.  This really is an isolatable non-problem.
  
///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Erik Naggum
Subject: Re: Wide character implementation
Date: 
Message-ID: <3225942059872001@naggum.net>
* Thomas Bushnell, BSG
| So a secondary question; if one is designing a new Common Lisp or Scheme
| system, and one is not encumbered by any requirements about being
| consistent with existing code, existing operating systems, or existing
| communications protocols and interchange formats: that is, if one gets to
| design the world over again:

  If we could design the world over again, the _first_ ting I would want to
  do is making "capital letter" a combining modifier instead of doubling
  the size of the code space required to handle it.  Not only would this be
  such a strong signal to people not to use case-sensitive identifiers in
  programming languages, we would have a far better time as programmers.
  E.g., considering the enormous amount of information Braille can squeeze
  into only 6 bits, with codes for many common words and codes to switch to
  and from digits and to capital letters, the limitations of their code
  space has effectively been very beneficial.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Ed L Cashin
Subject: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <871ye9i91x.fsf_-_@cs.uga.edu>
Erik Naggum <····@naggum.net> writes:

...
>   If we could design the world over again, the _first_ ting I would
>   want to do is making "capital letter" a combining modifier instead
>   of doubling the size of the code space required to handle it.  Not
>   only would this be such a strong signal to people not to use
>   case-sensitive identifiers in programming languages, we would have
>   a far better time as programmers.

Could you elaborate on that a bit?  I'm interested because it appears
that you're position is that case-sensitivity in identifiers is a Bad
Thing for programming languages.

A general principle of mine is that if things are distinguishable,
they should not be collapsed but the distinction should be preserved
whenever possible.  Treating different characters as the same
character, or treating different character sequences as equivalent,
should be postponed as long as possible in order to preserve
information.

Are you suggesting that this principle is inappropriate to apply to
the character sequences that compose identifiers in source code?  That
would mean that "ABLE" is the same identifier as "able".  I must admit
that when I first found out that current lisps have case-insensitive
symbol names, I thought it reminiscent of BASIC -- kind of a throwback
to a time when memory was much more at a premium.  (I know that Lisp
predates BASIC.  I'm talking about my reaction.)  I'd be happy to hear
a good case for case-insensitive identifiers.

-- 
--Ed L Cashin            |   PGP public key:
  ·······@uga.edu        |   http://noserose.net/e/pgp/
From: Kent M Pitman
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <sfwg02ptfvt.fsf@shell01.TheWorld.com>
Ed L Cashin <·······@uga.edu> writes:

> Erik Naggum <····@naggum.net> writes:
> 
> ...
> >   If we could design the world over again, the _first_ ting I would
> >   want to do is making "capital letter" a combining modifier instead
> >   of doubling the size of the code space required to handle it.  Not
> >   only would this be such a strong signal to people not to use
> >   case-sensitive identifiers in programming languages, we would have
> >   a far better time as programmers.
> 
> Could you elaborate on that a bit?  I'm interested because it appears
> that you're position is that case-sensitivity in identifiers is a Bad
> Thing for programming languages.
> 
> A general principle of mine is that if things are distinguishable,
> they should not be collapsed but the distinction should be preserved
> whenever possible.  Treating different characters as the same
> character, or treating different character sequences as equivalent,
> should be postponed as long as possible in order to preserve
> information.

Psychology experiments have empirically shown that memory is auditory.
That is, when you misremember words, you misremember them by soundalike,
not by lookalike.  There is also ample linguistic evidence that the core of
human language is an auditory phenomenon.  When languages vary, they first
change in their spoken form and then later writing catches up, not much
vice versa.  Since the spoken form has no notation for case differentiation,
the pretty obvious conclusion is that conceptual information is not best
carried in case.  People don't remember whether they saw a word written in
uppercase or lowercase, they just remember the word.  It is very rare and
quite awkward for someone to say "Use Capitalized-Foo" or 
"Use All-Uppercase-FOO" to someone out loud in areas other than computer
science where people have worked themselves into corners by being pedantic
on a "general principle" as in your previous paragraph rather than observing
well-researched truths about how people really think.  

Some of us believe that a proper harmonization/synchronization with the
way peoples' brains work is more important than catering to a theoretical
model that some people think would be a nice way for people to think.

I personally have made it a design goal in languages that I've worked on
to think hard about making even programming languages gracefully pronounceable
so that people can talk about programs aloud to each other over dinner, etc.
Modern Lisp has mostly moved away from obscure little names like "rplacd"
and such (a small number being retained mostly for history).  For new 
concepts, make names like MOST-POSITIVE-FIXNUM not MAXINT.  

Even in cased languages, mostly people don't use case to distinguish, they
just use it for controlling the look of code.  It's not uncommon for people
to have some things named Foo and others named BAR, but it's rarer for things
to be both named foo and Foo in a context where simple namespacing can't
tell the difference.  So often again you don't hear people saying the case
out loud because it can be determined from other factors.  At that point,
you might as well let people write stuff in whatever case they want, for
ease of input, and just let code pretty-printers adjust the case to a pretty
look if it's really needed.

IMO, no ordinary code should ever be case-sensitive and it's a darned shame
that XML is uses case-sensitive identifiers.  I think it does mainly so it
can service languages that have made a bad design decision ... so it's a 
dependent bad decision, not an independent one.

> Are you suggesting that this principle is inappropriate to apply to
> the character sequences that compose identifiers in source code?  That
> would mean that "ABLE" is the same identifier as "able".

Yes.

> I must admit
> that when I first found out that current lisps have case-insensitive
> symbol names, I thought it reminiscent of BASIC -- kind of a throwback
> to a time when memory was much more at a premium.  (I know that Lisp
> predates BASIC.  I'm talking about my reaction.)  I'd be happy to hear
> a good case for case-insensitive identifiers.

Cased names are often a substitute in infix languages for having given up
hyphen in a way that got messy.  You can't call a variable MOST-POSITIVE-FIXNUM
in most languages, because it thinks you mean MOST - POSITIVE - FIXNUM, a
subtraction.  Dylan requires you to put spaces around minus so it can 
have both minus and subtraction.  Doing MostPositiveFixnum is not very
natural and also forces case to be used in a way that supports separation,
taking away the ability to use case for what it was intended for: supporting
the underlying language.  So if I have a word like eBusiness in "English"
and I want to compose it into a function, do I make it be MakeeBusinessName
or MakeEbusinessName or .... personally, I prefer make-eBusiness-name.

It might even be better to use _'s, but it's a shifted character on most
keyboards, and people with weak fingers hate shifting that often, so hyphens
tend to be preferred.  make_eBusiness_name might otherwise be better, and
would save confusion with minus sign.

[CL uses uppercase as the canonical case for the case-normalized name,
and that's controversial with some people, but some of us like it.  In any
case, it's orthogonal to this other question about case translation.]

In any case, my real point is not to say there's a 100% clear answer here,
but merely to motivate that the choice of case-translation is not archaic
but definitely has support from people who think themselves to be living
in the present.
From: Christopher Browne
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <m3it7l9pxs.fsf@chvatal.cbbrowne.com>
Centuries ago, Nostradamus foresaw when Kent M Pitman <······@world.std.com> would write:
> Psychology experiments have empirically shown that memory is
> auditory.  That is, when you misremember words, you misremember them
> by soundalike, not by lookalike.  There is also ample linguistic
> evidence that the core of human language is an auditory phenomenon.
> When languages vary, they first change in their spoken form and then
> later writing catches up, not much vice versa.

I agree in part.

The "western" languages certainly are representative of that; our
languages are largely a way of taking what we say and putting it on
paper.  (Computers being an insignificant "blip" thus far in the
history of it :-).)

My understanding of the Asian languages is that they are often _not_
such a representation; what is written is _not_ an account what is
spoken.  Writing is, there, representative of a separate language.  In
more clearly "pictographic" languages, there may _not_ be an auditory
form except as constructed afterwards.

That caveat being given, words don't usually sound different when they
have different casing and aren't usually recognized as being
different.

"That" is not a different word from "that."
-- 
(reverse (concatenate 'string ····················@" "454aa"))
http://www.ntlug.org/~cbbrowne/linux.html
"Of  _course_ it's the murder weapon.   Who would frame someone with a
fake?"
From: Thomas A. Russ
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <ymi1ye8e0h1.fsf@sevak.isi.edu>
Ed L Cashin <·······@uga.edu> writes:
> I must admit
> that when I first found out that current lisps have case-insensitive
> symbol names, I thought it reminiscent of BASIC -- kind of a throwback
> to a time when memory was much more at a premium.  (I know that Lisp
> predates BASIC.  I'm talking about my reaction.)  I'd be happy to hear
> a good case for case-insensitive identifiers.

Point of Information:

  Lisp does not have case-insensitive symbols names.  They are most
certainly case-sensitive.  It is just that the default setting of the
input reader make it inconvenient to use mixed case identifiers, since
you need to escape either the lower-case characters

   f\o\o  gives the symbol named "Foo"

or the entire symbol name

   |Foo|  also gives the symbol named "Foo"

The default behavior of the reading process is to map all (non-escaped)
characters to uppercase.

There are ways around this, such as setting the readtable case to
:PRESERVE, which as you might suspect, preserves the input case.  With
that setting one could type Foo and get the symbol named "Foo".  But all
of the built-in Common Lisp symbols are defined to be in uppercase, so
that would mean having to type the built-in symbols all in uppercase.

It so happens that there is a very clever way around this, with
readtable case :INVERT, which inverts the case of all identifiers which
use either only lowercase or only uppercase, but preserves the case of
mixed case identifiers.  This probably gives you the best of both
worlds.

[Aside: Kent or anyone:  Who came up with the idea for the :INVERT
 readtable case?  It seems rather clever, even if in a slightly demented
 sort of way.]

-Tom.

-- 
Thomas A. Russ,  USC/Information Sciences Institute          ···@isi.edu    
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226021614417921@naggum.net>
* Ed L Cashin <·······@uga.edu>
| Could you elaborate on that a bit?  I'm interested because it appears
| that you're position is that case-sensitivity in identifiers is a Bad
| Thing for programming languages.

  I consider it a bad thing to believe that A is a different character from
  a just because it has a certain "presentation property".  I mean, we do
  not distinguish characters based on font or face, underlining or color,
  and most people realize that these are incidental properties.  However,
  capitalness of a letter is just as incidental: The fact that a letter is
  capitalized depending on such randomness as the position of the word in
  the sentence is a very strong indicator that "However" and "however" are
  not different words, which is effectively what case-sensitive people
  think they are.  I tried to publish text without this incidental property
  for a while, but it seemed to tick people off even more than calling an
  idiot an idiot.

| A general principle of mine is that if things are distinguishable, they
| should not be collapsed but the distinction should be preserved whenever
| possible.  Treating different characters as the same character, or
| treating different character sequences as equivalent, should be postponed
| as long as possible in order to preserve information.

  If you use colors to distinguish keywords from identifiers in our editor,
  can you use a keyword with a different color as an identifier?

| Are you suggesting that this principle is inappropriate to apply to the
| character sequences that compose identifiers in source code?  That would
| mean that "ABLE" is the same identifier as "able".

| I must admit that when I first found out that current lisps have
| case-insensitive symbol names, I thought it reminiscent of BASIC -- kind
| of a throwback to a time when memory was much more at a premium.

  But this is not the case.  The symbol names are case-sensitive, but the
  Common Lisp reader maps all unescaped characters to uppercase by default.
  You can change this.  Symbols are in this fashion just like normal words
  in your natural language.

| (I know that Lisp predates BASIC.  I'm talking about my reaction.)  I'd
| be happy to hear a good case for case-insensitive identifiers.

  I think case sensitivity is an abuse of an incidental property.  Thus, I
  want to hear a good case for case-sensitive identifers.  Older languages
  did not have this property, but after Unix (which has a case-insensitive
  tty mode!), the norm became to distinguish case, largely because there
  were no other namespace functionality in early C.  Unix also chose to use
  lower-case commands whereas Multics had always supported case-folding.  I
  believe the reason that the Unix people wanted to distinguish case was
  that it would require an extra instruction and a lookup table that would
  waste a precious 128 bytes of memory in the kernel, while we currently
  waste an enormous amount of memory to keep case-folding tables several
  times over.  In my view, case-sensitive identifiers has become the norm
  in a community that has failed to think about proper solutions to their
  problems, but rather choose to solve only the immediate problem, much
  like C strongly encourages irrelevant micro-optimization.  So instead of
  being nice to the user, they were nice to the programmer, who did not
  have to case-fold the incomding identifiers.  I consider moving this
  burdon onto the user to be quite user-inimical and actually quite foreign
  to people who do not know the character coding standards.  I mean, do we
  have case-sensitive trademarks, even though we traditionally capitalize
  proper names?  Are Oracle and ORACLE different companies any more than
  ORACLE in red boldface 14 point Times Roman is a different company than
  ORACLE in blue italic 12 point Helvetica?

  There has definitely been "paradigm shift" in computer people's view on
  case, but not in non-computer people.  Internet protocols like SMTP use
  case-insensitive commands.  The DNS is case-insensitive.  SGML is
  case-insensitive and so is HTML.  Because of the huge problems we face
  with case-folding Unicode (which must be done with a table of some kind),
  some people have figured that we should _not_ do case-folding.  That is
  the wrong solution to the problem.  The right solution to the problem is
  to get rid of case as a character property.

  Now, assume that we no longer have different character codes for lower-
  case and upper-case letters.  Would there be any difference in how we
  look at text on computer screens, in print, etc?  No, of course not.
  Therefore, people would still be able to distinguish identifiers visually
  based on case if they want to -- just like the Common Lisp reader allows
  you to write |car| to refer to the symbol named "car", and |CAR| to refer
  to the symbol named "CAR", and just like Unix can deal with upper- and
  lower-case letters even when iuclc and olcuc is in effect with the xcase
  option by backslashing the real uppercase characters in your input.  (In
  Common Lisp, you would backslash a lower-case character in the default
  reader mode, and the printer will escape those characters that should not
  be case-folded.)  However, being able to do something and actually doing
  it are two very different things.  E.g., on TOPS-20, you could use
  lower-case letters in filenames if you really wanted to, by prefixing
  them with ^V.  Very few people bothered to do this because typing it in
  was a hassle.  I do not propose any change to how we input upper and
  lower case, but with the anal-retentive approach to saving bits, which
  has even gone so far as to write FooBarZot instead of foo-bar-zot, the
  probablity that they C freaks would have chosen case-sensitivity would be
  remarkably lower -- if we could go back and design the world over...
  
///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Matthias Blume
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <m3adswokg3.fsf@hanabi.research.bell-labs.com>
Erik Naggum <····@naggum.net> writes:

> * Ed L Cashin <·······@uga.edu>
> | Could you elaborate on that a bit?  I'm interested because it appears
> | that you're position is that case-sensitivity in identifiers is a Bad
> | Thing for programming languages.
> 
>   I consider it a bad thing to believe that A is a different character from
>   a just because it has a certain "presentation property".  I mean, we do
>   not distinguish characters based on font or face, underlining or color,
>   and most people realize that these are incidental properties.  However,
>   capitalness of a letter is just as incidental: The fact that a letter is
>   capitalized depending on such randomness as the position of the word in
>   the sentence is a very strong indicator that "However" and "however" are
>   not different words, which is effectively what case-sensitive people
>   think they are.

This is not strictly true in all (natural) languages.

Example 1: German:
   - no 1-1 correspondence between upper-case and lower-case (there is one
     letter that only exists in the lower-case set)
   - some words change class, meaning, and pronunciation when going from
     one case to the other (example: Weg vs. weg)
   - case is used (or at least has been -- until it became non-pc in some
     circles) to put semantic fine points into print (e.g., capitalization of
     the second person in letters for politeness)

Example 2: Japanese
   - there is no distinction between upper-case and lower-case at all
   - HOWEVER: there are still two distinct sets of the phonetic characters
     called "hiragana" and "katakana".  Either one could spell the entire
     language, but usage of the two sets again depends on things like
     origin of the word in question, emphasis, style, etc.
     One could think of katakana as the upper-case version of hiragana.
     Usage is often analogous, for example one would sometimes find
     hiragana words spelled in katakana for EMPHASIS.
   - Written Japanese also uses kanji (Chinese characters), all of which could
     be spelled either in hiragana or katakana.  Unfortunately, the mapping
     between kanji and hiragana is many-to-many, which shows that the "is the
     same word" relationship is not an equivalence relation because it is
     not transitive:  "hashi" (chopsticks) and "hashi" (bridge) are spelled
     exactly the same in hiragana (but are pronounced slightly differently),
     but the kanji for the respective words are not the same.  OTOH, "kyou"
     and "konnichi" are clearly not the same words when spelled phonetically,
     but both correspond to the same kanji combination.  There are literally
     thousands of examples for this in Japanese (which does not make it particularly
     easy to learn :-).

Example 3: English
   - Speaking of "him" and speaking of "Him" are clearly semantically very different.

Example 4: Mathematics  (well, this one is not "natural", after all...)
   - In the "language of mathematics" we frequently make semantic distinctions
     between typographically different versions of the "same" character.

Anyway, all I wanted to say was that the distinction between different
versions of a character set are not completely incidental in many
(most?) natural languages.  I do not want to use this as as argument
for or against case-sensitive identifiers in programming languages,
since I do not think that programming languages should in any form or
manner be modelled after natural ones.  (However, I must admit that I
personally prefer being able to use mixed case when programming.)

Matthias
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226054464281011@naggum.net>
* Matthias Blume <········@shimizu-blume.com>
| This is not strictly true in all (natural) languages.

  All of these arguments indicate that using the capital letter for the
  sentence-initial word is a very bad design choice for a written language;
  it violates that strong sense of difference that those who want it to
  exist focus so strongly on.  However, I would argue that the sheer
  acceptability of destroying the importance of the capital letter in the
  sentence-intiial word cannot be ignored.  When I tried to _preserve_ the
  case of the word despite its position in the sentence, this was regarded
  as Very Wrong by a bunch of hostile lunatics.  This indicated to me that
  case is _primarily_ incidental, since the intrinsic role can at any time
  be overridden by the incidental role -- specifically, you have no idea
  whatsoever what the capitalization of the sentence-initial word would be
  if it were moved, yet this causes absolutely no problem for anyone.

| Anyway, all I wanted to say was that the distinction between different
| versions of a character set are not completely incidental in many (most?)
| natural languages.

  In real life, nothing is ever completely anything.  People use and abuse
  case "because it's there".  This would not change if capital letters were
  coded with a "flag" that communicated capitalness.  On the contrary, if
  we had such a flag, the natural development is to have _two_ flags: One
  for the incidental capital and one for the intrinsic capital.  In either
  case, the display and the coding properties of a character should be
  separated.  You provided an excellent example of this with hiragana and
  katakana.

| I do not want to use this as as argument for or against case-sensitive
| identifiers in programming languages, since I do not think that
| programming languages should in any form or manner be modelled after
| natural ones.

  That is not the argument.  Please try to understand this.  The point is
  that I have taken the liberty to design the world over again, backing up
  to _before_ computer geeks coded their character sets, and making a
  crucial change to the coding of upper-case vs lower-case characters.  The
  names "upper-case" and "lower-case" refer to typographic characteristics,
  not meaning.  Meaning may be coded separately from typography, just as we
  do in almost every other case,

| (However, I must admit that I personally prefer being able to use mixed
| case when programming.)

  If it had been most costly for you to achieve this, in terms of "knowing"
  that you would waste additional space to encode capital letters, would
  you still have done preferred it?  I believe, from the reactions to the
  extended experiment with not randmoly upcasing the sentence-initial word,
  that people would be inclined to accept a coding overhead for that role,
  as well as for proper nouns, but randmonly and liberally sprinkling such
  overhead throughout identifiers in order to achieve an unnatural visual
  effect only because it could be done, would most likely not happen.  As
  Common Lisp uses the hyphen to separate words, which would have no higher
  overhead than embedded capital letters, other languages would have far
  less inclination to make this horrible mistake, and would therefore not
  _require_ case-sensitivity.

  Whether the programmers would prefer a case-folding or a case-preserving
  case-insensitivty is an open question, but at least designing languages
  and coding conventions to use case would not likely happen if case was
  regarded as just as incidental as color or typeface.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Matthias Blume
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <folmcgejul.fsf@trex10.cs.bell-labs.com>
Erik Naggum <····@naggum.net> writes:

>   [ ... ] The point is
>   that I have taken the liberty to design the world over again [...]

Oh, how I'd *love* to live in a world where Erik Naggum is God... :-)
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226061533844203@naggum.net>
* Matthias Blume <········@shimizu-blume.com>
| Oh, how I'd *love* to live in a world where Erik Naggum is God... :-)

  Yeah, me too.  Then I could force you to pay attention to the premises
  that start a discussion instead of completely ignoring the context.
  Please see <················@naggum.net>, and pay particular attention to
  what Thomas Bushnell wrote.

  Sheesh, some people.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Thomas Bushnell, BSG
Subject: Back to character set implementation thinking
Date: 
Message-ID: <87ofhczdat.fsf_-_@becket.becket.net>
Erik Naggum <····@naggum.net> writes:

>   Yeah, me too.  Then I could force you to pay attention to the premises
>   that start a discussion instead of completely ignoring the context.
>   Please see <················@naggum.net>, and pay particular attention to
>   what Thomas Bushnell wrote.

So, getting back to my original question about charset implementations
in Lisp/Scheme (though actually Smalltalk or any such
dynamically-typed language will have the same questions and probably
the same kinds of solutions), I've done some more study and thinking,
so let me try again.  My previous question was a tad innocent, it
appears, because I was unaware of the great changes that have taken
place in Unicode since the last time I read through it and grokked the
whole thing (which was back at version 1.2 or something).  

I haven't fully internalized the terminology yet, though I'm trying.
So please bear with any minor terminological gaffes (and correct them,
too).  

The GNU/Linux world is rapidly converging on using UTF-8 to hold
31-bit Unicode values.  Part of the reason it does this is so that
existing byte streams of Latin-1 characters can (pretty much) be used
without modification, and it allows "soft conversion" of existing
code, which is quite easy and thus helps everybody switch.

But I'm thinking about a "design the world over again" kind of
strategy.  Now Erik is certainly right that capitalization *should* be
a combining character kind of thing.  So let me stipulate that I want
to take Unicode as-is; I get to design *my computer system*, subject
to the a priori constraint that Unicode has done a *lot* of work, so I
will accept slight deficiencies if they help Unicode work right on the
system.  So I'll take the existing Unicode encodings, even if they
don't do capitals just like we'd want.

But I don't get to redesign existing communications protocols and
such; however, that's an externalization issue, and for internal use
on the system, such protocols don't matter.  Similar comments apply
for existing filesystems formats, file conventions, and the like.

Now, I *could* just use UTF-8 internally, but that seems rather
foolish.  I think it's obvious that characters should be "immediately"
represented in pointer values in the way that fixnums are.

Now the Universal Character Set is officially 31 bits, but only 16
bits are in use now, and it is expected that at most 21 bits will be
used.  So that means it's pretty easy to make sure the whole space of
UCS values fits in an immediate representation.  That's fine for
working with actively used data.

However, strings that are going to be kept around a long time should,
it seems to me, be stored more compactly.  Essentially all strings
will be in the Basic Multilingual Plane, so they can fit in 16 bits.
That means there would be two underlying string datatypes.  I don't
think this is a serious problem.  Is it worth having a third (for
8-bit characters) so that Latin-1 files don't have to be inflated by a
factor of two?  It seems to me that this would be important too.
Basically then we would have strings which are UCS-4, UCS-2 and
Latin-1 restricted (internally, not visibly to users).

So even if strings are "compressed" this way, they are not UTF-8.
That's Right Out.  They are just direct UCS values.  Procedures like
string-set! therefore might have to inflate (and thus copy) the entire
string if a value outside the range is stored.  But that's ok with me;
I don't think it's a serious lose.

So is this sane?

Ok, then the second question is about combining characters.  Level 1
support is really not appropriate here.  It would be nice to support
Level 3.  But perhaps Level 2 with Hangul Jamo characters [are those
required for Level 2?] would be good enough.

It seems to me that it's most appropriate to use Normalization Form
D.  Or is that crazy?  It has the advantage of holding all the Level 3
values in a consistent way.  (Since precombined characters do not
exist for all possibilities, Normalization Form C results in some
characters precombined and some not, right?)

And finally, should the Lisp/Scheme "character" data type refer to a
single UCS code point, or should it refer to a base character together
with all the combining characters that are attached to it?

Thomas
From: Erik Naggum
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <3226095271716329@naggum.net>
* Thomas Bushnell, BSG
| The GNU/Linux world is rapidly converging on using UTF-8 to hold 31-bit
| Unicode values.  Part of the reason it does this is so that existing byte
| streams of Latin-1 characters can (pretty much) be used without
| modification, and it allows "soft conversion" of existing code, which is
| quite easy and thus helps everybody switch.

  UTF-8 is in fact extreemly hostile to applications that would otherwise
  have dealt with ISO 8859-1.  The addition of a prefix byte has some very
  serious implications.  UTF-8 is an inefficient and stupid format that
  should never have been proposed.  However, it has computational elegance
  in that it is a stateless encoding.  I maintain that encoding is stateful
  regardless of whether it is made explicit or not.  I therefore strongly
  suggest that serious users of Unicode employ the compression scheme that
  has been described in Unicode Technical Report #6.  I recommend reading
  this technical report.

  Incidentally, if I could design things all over again, I would most
  probably have used a pure 16-bit character set from the get-go.  None of
  this annoying 7- or 8-bit stuff.  Well, actually, I would have opted for
  more than 16-bit units -- it is way too small.  I think I would have
  wanted the smallest storage unit of a computer to be 20 bits wide.  That
  would have allowed addressing of 4G of today's bytes with only 20 bits.
  But I digress...

| So even if strings are "compressed" this way, they are not UTF-8.  That's
| Right Out.  They are just direct UCS values.  Procedures like string-set!
| therefore might have to inflate (and thus copy) the entire string if a
| value outside the range is stored.  But that's ok with me; I don't think
| it's a serious lose.

  There is some value to the C/Unix concept of a string as a small stream.
  Most parsing of strings needs to parse so from start to end, so there is
  no point in optimizing them for direct access.  However, a string would
  then be different from a vector of characters.  It would, conceptually,
  be more like a list of characters, but with a more compact encoding, of
  course.  Emacs MULE, with all its horrible faults, has taken a stream
  approach to character sequences and then added direct access into it,
  which has become amazingly expensive.

  I believe that trying to make "string" both a stream and a vector at the
  same time is futile and only leads to very serious problems.  The default
  representation of a string should be stream, not a vector, and accessors
  should use the stream, such as with make-string-{input,output}-stream,
  with new operators like dostring, instead of trying to use the string as
  a vector when it clearly is not.  The character concept needs to be able
  to accomodate this, too.  Such pervasive changes are of course not free.

| Ok, then the second question is about combining characters.  Level 1
| support is really not appropriate here.  It would be nice to support
| Level 3.  But perhaps Level 2 with Hangul Jamo characters [are those
| required for Level 2?] would be good enough.

  Level 2 requires every other combining character except Hangul Jamo.

| It seems to me that it's most appropriate to use Normalization Form D.

  I agree for the streams approach.  I think it is important to make sure
  that there is a single code for all character sequences in the stream
  when it is converted to a vector.  The private use space should be used
  for these things, and a mapping to and from character sequences should be
  maintained such that if a private use character is queried for its
  properties, those of the character sequence would be returned.

| Or is that crazy?  It has the advantage of holding all the Level 3 values
| in a consistent way.  (Since precombined characters do not exist for all
| possibilities, Normalization Form C results in some characters
| precombined and some not, right?)

  Correct.

| And finally, should the Lisp/Scheme "character" data type refer to a
| single UCS code point, or should it refer to a base character together
| with all the combining characters that are attached to it?

  Primarily the code point, but both, effectively, by using the private use
  space as outlined above.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Christopher Browne
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <m31ye857kf.fsf@chvatal.cbbrowne.com>
The world rejoiced as Erik Naggum <····@naggum.net> wrote:
> * Thomas Bushnell, BSG
> | The GNU/Linux world is rapidly converging on using UTF-8 to hold 31-bit
> | Unicode values.  Part of the reason it does this is so that existing byte
> | streams of Latin-1 characters can (pretty much) be used without
> | modification, and it allows "soft conversion" of existing code, which is
> | quite easy and thus helps everybody switch.
>
>   UTF-8 is in fact extreemly hostile to applications that would otherwise
>   have dealt with ISO 8859-1.  The addition of a prefix byte has some very
>   serious implications.  UTF-8 is an inefficient and stupid format that
>   should never have been proposed.  However, it has computational elegance
>   in that it is a stateless encoding.  I maintain that encoding is stateful
>   regardless of whether it is made explicit or not.  I therefore strongly
>   suggest that serious users of Unicode employ the compression scheme that
>   has been described in Unicode Technical Report #6.  I recommend reading
>   this technical report.
>
>   Incidentally, if I could design things all over again, I would most
>   probably have used a pure 16-bit character set from the get-go.  None of
>   this annoying 7- or 8-bit stuff.  Well, actually, I would have opted for
>   more than 16-bit units -- it is way too small.  I think I would have
>   wanted the smallest storage unit of a computer to be 20 bits wide.  That
>   would have allowed addressing of 4G of today's bytes with only 20 bits.
>   But I digress...

You should have a chat with Charles Moore, of Forth fame.  He
designed, using a CAD system he wrote in Forth, called OK, a 20 bit
microprocessor that (surprise, surprise...  NOT!) has an instruction
set designed specifically for Forth.

Something that is unfortunate is that the 36 bit processors basically
died off in favor of 32 bit ones.  Which means we have great gobs of
algorithms that assume 32 bit word sizes, with the only leap anyone
can conceive of being to 64 bits, and meaning that if you need a tag
bit or two for this or that, 32 bit operations wind up Sucking Bad.

But I digress, too...
-- 
(concatenate 'string "cbbrowne" ·@ntlug.org")
http://www.ntlug.org/~cbbrowne/oses.html
Rules of the  Evil Overlord #230. "I will  not procrastinate regarding
any ritual granting immortality."  <http://www.eviloverlord.com/>
From: cr88192
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <ua00sjlqnnv996@corp.supernews.com>
> 
> Something that is unfortunate is that the 36 bit processors basically
> died off in favor of 32 bit ones.  Which means we have great gobs of
> algorithms that assume 32 bit word sizes, with the only leap anyone
> can conceive of being to 64 bits, and meaning that if you need a tag
> bit or two for this or that, 32 bit operations wind up Sucking Bad.
> 
hello, personally I don't really know what the big difference is...
I would have imagined that in any case a slightly larger word size would 
have been useful, but it is not...
sometimes for some of my code I use 48 bit ints (when 32 bits is too small 
and 64 is overkill). I would think that with 36 bits the next size up would 
be 72, and 36 is not evenly divisible by 8 so you would need a different 
byte size as well (ie: 9 or 12).
sorry, I don't really know of byte sizes other than 8...
am I missing something?

(little has changed in my life since before, except that I am working on an 
os now... again...).
From: Erik Naggum
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <3226111629531727@naggum.net>
* cr88192 <·······@hotmail.com>
| sorry, I don't really know of byte sizes other than 8...
| am I missing something?

  Yes.  A "byte" is only a contiguous sequence of bits in a machine word,
  and has been used that way by most vendors, for us notably DEC, which
  contributed the machine instructions we know as LDB and DPB and the
  notion of a byte specifier, which has bit position in word and length in
  bits.  Failure to support LDB and DPB in hardware is very costly for a
  large number of useful operations, but on an a byte-addressable world
  with 8-bit bytes, using anything smaller than bytes that might cross byte
  boundaries has serious penalties.  In a word-addressable world, this
  saves a lot of memory, even relative to the byte-adressable machines.  C
  has bit fields because it was intended to run on Honewyell 6000, which
  had 36-bit words, so its "char" was 9 bits wide.  (See page 34 of
  Kernighan & Ritchie, 1st ed.)

  IBM chose a more specific terminology: 4-bit nybbles (the same spelling
  deviation as "byte" from "bite"), 8-bit bytes, 16-bit half-words, 32-bit
  words, and 64-bit double-words.  On the PDP-10, we had 36-bit words,
  18-bit half-words (and halfword instructions), but bytes were all over
  the place.  I knwo several people who think this is a much better design
  than the stupid 8-bit design we have today.  Sadly, only several, not
  millions and millions who think Intel's designs are better just because
  they can buy them.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Pekka P. Pirinen
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <usn6kh477.fsf@globalgraphics.com>
·········@becket.net (Thomas Bushnell, BSG) writes:
> So, getting back to my original question about charset implementations
> in Lisp/Scheme (though actually Smalltalk or any such
> [much snippage]
> So that means it's pretty easy to make sure the whole space of
> UCS values fits in an immediate representation.  That's fine for
> working with actively used data.

Even for actively used data, compactness of representation pays off in
better cache efficency.  In fact, particularly for actively used data
should we be mindful of this.  Since you seem to be thinking of a
32-bit immediate representation, an improvement to 16-bit strings or
even 8-bit strings is nothing to be sneezed at.

> However, strings that are going to be kept around a long time should,
> it seems to me, be stored more compactly.  Essentially all strings
> will be in the Basic Multilingual Plane, so they can fit in 16 bits.
> That means there would be two underlying string datatypes.  I don't
> think this is a serious problem.

As an implementor, I can tell you that actually the step from one
string type to two is the hardest bit.  Once you've figured out how
you want to implement that, having more is not such a big deal.  From
a programmer's point of view, the efficiency gains from more string
types outweigh the costs (unless you think you could do without the
larger ones), even if you have to deal with it explicitly.

> Is it worth having a third (for 8-bit characters) so that Latin-1
> files don't have to be inflated by a factor of two?  It seems to me
> that this would be important too.

Files and strings don't really have much to do with each other.  Files
are an externalization issue.  Of course you can store files in UCS,
and sometimes that's the right thing to do, but in the real world, you
have to deal with all kinds of encodings, so you need the machinery,
anyway, to read and write Shift-JIS, Big5, Latin-1, UTF-8, etc.

Like I said above, it _is_ important to have an 8-bit string type.
People in the West, who rarely even realize they could easily support
16-bit users, will get great benefits.  And between files and
"actively used data", there are those people who want to load their
entire database in main memory and compute with that; they'll get
their size limit extended as well.

> Basically then we would have strings which are UCS-4, UCS-2 and
> Latin-1 restricted (internally, not visibly to users). [...]
> Procedures like string-set! therefore might have to inflate (and
> thus copy) the entire string if a value outside the range is stored.
> But that's ok with me; I don't think it's a serious lose.

I suppose that is a viable implementation strategy, but I don't think
it's the right option.  The language should expose the range of string
data types to the programmer, and let them choose, because the range
of memory usage is just too great to sweep under the mat.  Also,
having strings automatically reallocated means an extra indirection
for access which cannot always be optimized away.

I note that offering multiple string types is exactly what all the CL
implementations seem to have done.  This doesn't preclude having
features that automatically select the smallest feasible type, e.g.,
for "" read syntax or a STRING-APPEND function.
-- 
Pekka P. Pirinen
The gap between theory and practice is bigger in practice than in theory.
From: Thomas Bushnell, BSG
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <878z8cv4jw.fsf@becket.becket.net>
···············@globalgraphics.com (Pekka P. Pirinen) writes:

> > Is it worth having a third (for 8-bit characters) so that Latin-1
> > files don't have to be inflated by a factor of two?  It seems to me
> > that this would be important too.
> 
> Files and strings don't really have much to do with each other.  Files
> are an externalization issue.  Of course you can store files in UCS,
> and sometimes that's the right thing to do, but in the real world, you
> have to deal with all kinds of encodings, so you need the machinery,
> anyway, to read and write Shift-JIS, Big5, Latin-1, UTF-8, etc.

In the system I'm contemplating, there are no files in the normal
sense of the term; all user data lives as strings, more or less (there
might be something more clever, but whateve).  Whatever strategies are
done for strings (and similar structures) will be important for all
files.

So such data has to be efficiently stored...

> I note that offering multiple string types is exactly what all the CL
> implementations seem to have done.  This doesn't preclude having
> features that automatically select the smallest feasible type, e.g.,
> for "" read syntax or a STRING-APPEND function.

But this is, it seems to me, unclean.

I think of it as being similar to the way numbers work.  Yes, I can
find out whether a given number is a fixnum or a bignum, and I might
well care in some special case.  But normally I just use numbers and
expect the system to automagically do the right thing.

Similarly, I want the string type to simply encode Unicode strings,
and the user should not be forced to deal with more.  The user should
not need to guess at the time the string is created whether or not it
will later need to hold a bigger character code, for example.

Thomas
From: Brian Spilsbury
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <f0f9d928.0203290119.4152e59f@posting.google.com>
·········@becket.net (Thomas Bushnell, BSG) wrote in message 
> Similarly, I want the string type to simply encode Unicode strings,
> and the user should not be forced to deal with more.  The user should
> not need to guess at the time the string is created whether or not it
> will later need to hold a bigger character code, for example.
> 

I think you need to differentiate between mutable and immutable
strings.

A mutable string which is not explicitly restricted (such as
simple-base-string) needs to be able to hold any character, so it
needs to be conservative.

An immutable string cannot be modified, so you are free to encode it
however you like, as long as you can represent whatever you have it
in.

The remainder of the problem is the idea of strings as vectors rather
than sequences, as sequences the O(1) access is no-longer an issue
(although you'd want better iteration support than CL currently
provides).

Beyond this it should be trivial to have an immutable string type
which knows what encoding it is using, and can tell the system what
accessor to use.

As a side-note, string literals and the names of symbols are immutable
in CL.

In addition you would need an operator to encode a mutable string as
an immutable string (using a given encoding), options for immutable
construction for subseq, concatenate, string-output-stream, etc would
also be useful.

Regards,

Brian
From: Erik Naggum
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <3226402339496495@naggum.net>
* Brian Spilsbury
| I think you need to differentiate between mutable and immutable
| strings.

  I have suggested that strings need to be separated into two mor basic
  types: a stream which you read one element at a time, and a vector which
  provides random access.  The former maps directly to files and is
  suitable for parsing and formatting, while a vector of characters is more
  useful for repeated access to the same characters.

  We have the system class string-stream today, which offers stream access
  to a string, but I think we need a subclass of string like stream-string,
  which may contain such things as the octets from another stream such as
  directly from an input file, and be processed sequentially, and therefore
  should also be able to use stateful encodings such that reading through
  them with the string-stream functions would maintain that state.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Brian Spilsbury
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <f0f9d928.0203300042.501de3ad@posting.google.com>
Erik Naggum <····@naggum.net> wrote in message news:<················@naggum.net>...
> * Brian Spilsbury
> | I think you need to differentiate between mutable and immutable
> | strings.
> 
>   I have suggested that strings need to be separated into two mor basic
>   types: a stream which you read one element at a time, and a vector which
>   provides random access.  The former maps directly to files and is
>   suitable for parsing and formatting, while a vector of characters is more
>   useful for repeated access to the same characters.
> 
>   We have the system class string-stream today, which offers stream access
>   to a string, but I think we need a subclass of string like stream-string,
>   which may contain such things as the octets from another stream such as
>   directly from an input file, and be processed sequentially, and therefore
>   should also be able to use stateful encodings such that reading through
>   them with the string-stream functions would maintain that state.

I think that this approach separates things which do not require it.

If we view a string as a sequence rather than a vector, I believe that
most of these problems evaporate.

A sequence contains things which have both
vector-access-characteristics and list-access-characteristics.

The problem is that sequences in CL have relatively poor iteration
support.

One of the more complex things that we might want to do with a string
is to tokenise it.

(let ((last-point nil))
  (dosequence (char point string)
    (when (char= char #\,)
       (if last-point
           (collect (subseq string :start-point last-point :end-point
point))
           (setq last-point point)))))

for a half-baked example, to break up a string into a list of comma
delimited strings.

The key here is the ability to access a sequence from a stored point
in the sequence, and to use these points to delimit sequence actions.

Given this a string can easily have either kind of substrate - a
random access, or linear access implementation, and this behaviour
extends naturally to lists.

There are some issues with points and the mutation of the string, as
well as the usable life-time of the points, but I think that these can
be addressed with some thought.

This also does not preclude the (expensive) random access of a
variable-width character string, and would also tie into the lazy
construction of sequences (whereby you might deal with a file as a
lazy sequence, something like a lisp version of mmap).

Anyhow, given that variable-width-character strings would tend to be
immutable (or perhaps extensible and truncatable) points should have
few problems there. I don't see any issues with points into lists
either.

Regards,

Brian
From: Erik Naggum
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <3226482787784866@naggum.net>
* Brian Spilsbury
| I think that this approach separates things which do not require it.
| 
| If we view a string as a sequence rather than a vector, I believe that
| most of these problems evaporate.

  I think we have a terminological problem here.  What you call a sequence
  is not the Common Lisp concept of "sequence" since all of list, string,
  vector are sequences.  I think you mean something very close to what I
  mean by stream-string with your non-Common Lisp "sequence" concept.

| A sequence contains things which have both vector-access-characteristics
| and list-access-characteristics.

  This would also a new invention because this is currently foreign to
  Common Lisp.  What I _think_ you mean is very close to what I have tried
  to explain in (more) Common Lisp terminology.

| The problem is that sequences in CL have relatively poor iteration
| support.

  Well, there is nothing in Common Lisp that has both O(1) and O(n) access
  characteristics, and nothing in Common Lisp that has both support for
  random access and sequential access.  I propose that stream-string
  support sequential access and string remaining the random access.

| One of the more complex things that we might want to do with a string is
| to tokenise it.

  Precisely, but this is a problem that has many different kinds of
  solutions, not just one.

| (let ((last-point nil))
|   (dosequence (char point string)
|     (when (char= char #\,)
|        (if last-point
|            (collect (subseq string :start-point last-point :end-point
| point))
|            (setq last-point point)))))
| 
| for a half-baked example, to break up a string into a list of comma
| delimited strings.

  I prefer a design that has an opaque mark in a stream-string iterator,
  but this should also be in regular streams.  Extracting the string
  between mark and point (in Emacs terminology) may re-establish some
  context in the new string if it is merely a sub-stream-string, but could
  also copy characters into a string (vector).

| The key here is the ability to access a sequence from a stored point in
| the sequence, and to use these points to delimit sequence actions.

  I think the key is that you do not want the string itself to know
  anything about how it is being read sequentially, but a simple pointer
  into the string is not enough.  (C has certainly shown us the folly of
  such a design.)  Specifically, I want a stream-string ot be processed
  both with read-byte and read-char.

| Given this a string can easily have either kind of substrate - a random
| access, or linear access implementation, and this behaviour extends
| naturally to lists.

  Well, I have implemented a few processors for weird and stateful
  encodings, and I can tell you that it is not easily done.

| This also does not preclude the (expensive) random access of a
| variable-width character string, and would also tie into the lazy
| construction of sequences (whereby you might deal with a file as a
| lazy sequence, something like a lisp version of mmap).

  I think random access into a variable-width string is simply wrong, like
  using nth to do more than grab exactly one element of a list.

| Anyhow, given that variable-width-character strings would tend to be
| immutable (or perhaps extensible and truncatable) points should have few
| problems there.  I don't see any issues with points into lists either.

  Except that you generally need quite a lot of state, which a stream
  implementation would be fully able to support for you.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Brian Spilsbury
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <f0f9d928.0203301055.67c2fe85@posting.google.com>
Erik Naggum <····@naggum.net> wrote in message news:<················@naggum.net>...
> * Brian Spilsbury
> | I think that this approach separates things which do not require it.
> | 
> | If we view a string as a sequence rather than a vector, I believe that
> | most of these problems evaporate.
> 
>   I think we have a terminological problem here.  What you call a sequence
>   is not the Common Lisp concept of "sequence" since all of list, string,
>   vector are sequences.  I think you mean something very close to what I
>   mean by stream-string with your non-Common Lisp "sequence" concept.

My point is that string is defined as vector in CL.

It is only due to being a vector that a string is a sequence.

A string cannot use non-vector substrate in CL, if it were
fundamentally a sequence, they it could, as long as that substrate
satisfied sequence.

(although, from memory vectors are not necessarily O(1) random access
in CL, so you might produce such a primitive type as a kind of vector,
except that vectors   types don't have the expressivity for noting
encodings, etc...)
 
> | A sequence contains things which have both vector-access-characteristics
> | and list-access-characteristics.
> 
>   This would also a new invention because this is currently foreign to
>   Common Lisp.  What I _think_ you mean is very close to what I have tried
>   to explain in (more) Common Lisp terminology.

I think the issue here is the distinction between a primitive
data-type in CL and a type-definition.

When I say sequence, I mean the type-definition, rather than a
particular data-type.

> | The problem is that sequences in CL have relatively poor iteration
> | support.
> 
>   Well, there is nothing in Common Lisp that has both O(1) and O(n) access
>   characteristics, and nothing in Common Lisp that has both support for
>   random access and sequential access.  I propose that stream-string
>   support sequential access and string remaining the random access.

Lists have support for random access implemented via sequential
accessors.
Vectors have support for linear access implemented via random
accessors.

I don't see a problem with providing a unified interface which at
least brings continuing iteration from saved positions to O(1) [which
would include simply fetching the value at that point, although that
doesn't seem very useful].

The real problem is that sequence doesn't define any iterative
operators, only cons [as list] does via cdr/rest and dolist, and the
ad-hoc support via loop.

> | One of the more complex things that we might want to do with a string is
> | to tokenise it.
> 
>   Precisely, but this is a problem that has many different kinds of
>   solutions, not just one.
> 
> | (let ((last-point nil))
> |   (dosequence (char point string)
> |     (when (char= char #\,)
> |        (if last-point
> |            (collect (subseq string :start-point last-point :end-point
> | point))
> |            (setq last-point point)))))
> | 
> | for a half-baked example, to break up a string into a list of comma
> | delimited strings.
> 
>   I prefer a design that has an opaque mark in a stream-string iterator,
>   but this should also be in regular streams.  Extracting the string
>   between mark and point (in Emacs terminology) may re-establish some
>   context in the new string if it is merely a sub-stream-string, but could
>   also copy characters into a string (vector).

I do not think that limiting yourself to a single mark/point pair, nor
keeping a mark/point in the container, where any modification
propagates side-effects, is a particularly good strategy for lisp.

I think that this makes sense for a Text-Buffer type object (which is
what emacs uses that approach for), though. A Stream interface to a
Text-Buffer would make perfect sense imho.
 
> | The key here is the ability to access a sequence from a stored point in
> | the sequence, and to use these points to delimit sequence actions.
> 
>   I think the key is that you do not want the string itself to know
>   anything about how it is being read sequentially, but a simple pointer
>   into the string is not enough.  (C has certainly shown us the folly of
>   such a design.)  Specifically, I want a stream-string ot be processed
>   both with read-byte and read-char.

I don't think that this is particularly relevant to strings, although
for a string-stream, certainly.

> | Given this a string can easily have either kind of substrate - a random
> | access, or linear access implementation, and this behaviour extends
> | naturally to lists.
> 
>   Well, I have implemented a few processors for weird and stateful
>   encodings, and I can tell you that it is not easily done.

I think it is relatively straightforward, in some encodings the amount
of state might be annoyingly large, though.

In UTF-8, euc-kr, euc-jp, etc there is no state to be saved except for
the octet-position.

In the standard compression scheme for unicode you need to save
Single-Byte-Mode-P, Current-Window, and the 8 Dynamic-Window-Offsets,
and Locking-Shift-P, I've only glanced over the spec, so please excuse
omission or error.

The unicode SCS is pretty heavy on state, I'll agree, that's 11 words
in the most conversative form, although there are various
optimisations you could apply, I might expect to represent that in 5
32-bit words with packing.

The other advantage is that we don't need to store the state in the
string at all, the transitory state is kept in the iterator (ie,
dosequence, map, subseq, etc), and this means that we can share the
string freely between readers, as we currently expect to be able to.

> | This also does not preclude the (expensive) random access of a
> | variable-width character string, and would also tie into the lazy
> | construction of sequences (whereby you might deal with a file as a
> | lazy sequence, something like a lisp version of mmap).
> 
>   I think random access into a variable-width string is simply wrong, like
>   using nth to do more than grab exactly one element of a list.
> 
> | Anyhow, given that variable-width-character strings would tend to be
> | immutable (or perhaps extensible and truncatable) points should have few
> | problems there.  I don't see any issues with points into lists either.
> 
>   Except that you generally need quite a lot of state, which a stream
>   implementation would be fully able to support for you.

I think that a lot of state is the exception rather than the rule.

I also think that as shown above, we can externalise that state into
points, at an acceptable cost for reasonable encodings.

Better sequence iteration support might also facilitate a general
sequence-stream mechanism.

It may be that I am unaware of some more complex common encodings, if
there are any that you are thinking of in specific, please let me
know.

Regards,

Brian
From: Erik Naggum
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <3226532389569746@naggum.net>
* Brian Spilsbury
| A string cannot use non-vector substrate in CL, if it were
| fundamentally a sequence, they it could, as long as that substrate
| satisfied sequence.

  As I said, we have a terminological problem here.  vector and list are
  disjoint subclasses of sequence.  string is a subclass of vector.

| from memory vectors are not necessarily O(1) random access in CL,

  This might be at the core of your confusion.

| When I say sequence, I mean the type-definition, rather than a particular
| data-type.

  I know Common Lisp too well to understand what you mean.

| Lists have support for random access implemented via sequential
| accessors.  Vectors have support for linear access implemented via random
| accessors.

  No, this is really fundamentally confused.  Random access _means_ O(1).
  Linear access means that you have a first-class pointer to each element,
  required to access the next.  Both the cons cell and the stream satisfy
  the latter.

| The real problem is that sequence doesn't define any iterative operators,
| only cons [as list] does via cdr/rest and dolist, and the ad-hoc support
| via loop.

  What is "ad-hoc" about it?  This is very puzzling.

| I do not think that limiting yourself to a single mark/point pair, nor
| keeping a mark/point in the container, where any modification propagates
| side-effects, is a particularly good strategy for lisp.

  I think you should read what I write a little better.  It is vital that
  mark and point are _not_ part of the string, but of the iterator.  I have
  said as much.  Please do not rudely ask me to waste my time to refute
  conclusions based on things I have not said.

| I think it is relatively straightforward, in some encodings the amount
| of state might be annoyingly large, though.

  Well, we just appear to have different tolerance of necessities, or you
  know some encodings I do not, which I kind of doubt.  An example of a
  stateful encoding with an annoyingly large amount of state would be
  useful so I know where the amount becomes annoyingly large.

| In the standard compression scheme for unicode you need to save
| Single-Byte-Mode-P, Current-Window, and the 8 Dynamic-Window-Offsets, and
| Locking-Shift-P, I've only glanced over the spec, so please excuse
| omission or error.

  Seems pretty accurate.

| The unicode SCS is pretty heavy on state, I'll agree, that's 11 words
| in the most conversative form, although there are various
| optimisations you could apply, I might expect to represent that in 5
| 32-bit words with packing.

  This is so heavy on state you want to optimize the storage?  My good man,
  this is nothing and not worth optimizing.

| The other advantage is that we don't need to store the state in the
| string at all, the transitory state is kept in the iterator (ie,
| dosequence, map, subseq, etc), and this means that we can share the
| string freely between readers, as we currently expect to be able to.

  I am really curious now.  You _always_ store the state in the object that
  modifies it, _never_ in the object it refers to.  A peculiar C++ disease
  which I had the good fortune of discussing with a project leader who just
  had to vent his frustration with some of his programmers and their sheer
  inability to write threadsafe code precisely because they were hell-bent
  on "optimizing" data storage and stored the state of an iterator in the
  object iterated over.  I wondered how anyone could even think of such an
  obviously boneheaded thing, but these people, he told me, were so deeply
  concerned with not using dynamic memory and conserving memory in general
  that they made this idiotic coding practice a matter of _pride_ and would
  therefore not consider changing it, even when ordered to fix the problem.
  Thread safety or, more generally, the ability to have multiple references
  to the same object, is the Lisp way, and being anal about memory usage is
  not the Lisp way.

| I think that a lot of state is the exception rather than the rule.

  You are actually wrong about this.  The ideal of statelessness is
  generally a very bad idea, as it tries to hide state under the rug.
  Generally, state can be layered, and this is good, but it is therefore
  exctemely important to layer it correctly.  I mean, I thought this would
  be exceptionally obvious when we have a string-stream concept that can
  iterate over a string with stream operators, but you have to be explicit
  about setting up the these iterators.  (It should have been more general,
  so one could iterate over the elements of a vector with read-byte.)

| I also think that as shown above, we can externalise that state into
| points, at an acceptable cost for reasonable encodings.

  I truly wonder how you could have thought that anyone would want to store
  the iteration state in the object iterated over.  That is such a classic
  mistake that I am annoyed that I have to argue against it.

| It may be that I am unaware of some more complex common encodings, if
| there are any that you are thinking of in specific, please let me know.

  Try implementing a full ISO 2022 processor, try representing the device
  that ISO 6429 (informally known as "ANSI escape sequences") writes to, or
  consider the amount of state in a fully fledged MIME processor.  Side-
  effects and modifying state is a good thing, but it must, of course, be
  localized with the functions that maintains the state, not with the
  object that is being referenced incidentally.  Or maybe this is just that
  annoyingly stupid Object Oriented Programming thing, again, where the
  object itself is supposed to know something about how it is used.  This
  is just plain bad design.  Stuffing "next" pointers into a structure to
  build a linked list is equally nuts, but many believe this is good and
  cannot fathom the point of using a vector or a linked list that points to
  the objects in question.  Such people should be kept away from computers.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Brian Spilsbury
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <f0f9d928.0203310009.564d69bb@posting.google.com>
Erik Naggum <····@naggum.net> wrote in message news:<················@naggum.net>...
> * Brian Spilsbury
> | from memory vectors are not necessarily O(1) random access in CL,
> 
>   This might be at the core of your confusion.

It's possible, but you have provided no reasoning or references.

"System Class ARRAY:

An array contains objects arranged according to a Cartesian coordinate
system. An array provides mappings from a set of fixnums
{i0,i1,...,ir-1} to corresponding elements of the array, where 0 <=ij
< dj, r is the rank of the array, and dj is the size of dimension j of
the array."

Vectors are defined in terms of arrays.

The definition of an array is such that you could implement an array
via a hash-bucket which accepted only integers in the specified range.

> | Lists have support for random access implemented via sequential
> | accessors.  Vectors have support for linear access implemented via random
> | accessors.
> 
>   []  Random access _means_ O(1). []

No, random access means that the interface allows access to elements
in a random order.

This does not necessarily imply an O(1) access characteristic,
although this might be commonly expected.

As an example:
 * Does a hash-bucket object provide a random-access accessor?
 * Is it O(1) to access?
 * Does the degenerate case of a hash-bucket containing only one
bucket implemented with a list give O(n) access?

> | The real problem is that sequence doesn't define any iterative operators,
> | only cons [as list] does via cdr/rest and dolist, and the ad-hoc support
> | via loop.
> 
>   What is "ad-hoc" about it? []

What is ad-hoc is that loop is a nice baroque flow-control language
which happens to have some support for iterating sequences in certain
circumstances.
.
Loop is not an iteration primitive for sequences, and CL does not
contain such a primitive to my knowledge.
 
> | I do not think that limiting yourself to a single mark/point pair, nor
> | keeping a mark/point in the container, where any modification propagates
> | side-effects, is a particularly good strategy for lisp.
> 
>   [] It is vital that mark and point are _not_ part of the string, but of the iterator.  []

I'm glad that you agree.
 
> | I think it is relatively straightforward, in some encodings the amount
> | of state might be annoyingly large, though.
> 
>   Well, we just appear to have different tolerance of necessities, or you
>   know some encodings I do not, which I kind of doubt.  An example of a
>   stateful encoding with an annoyingly large amount of state would be
>   useful so I know where the amount becomes annoyingly large.

This depends on how easily annoyed you are. The example of the SCS
encoding is one that I would consider to have a relatively large
amount of state carried between elements.

> | The unicode SCS is pretty heavy on state, I'll agree, that's 11 words
> | in the most conversative form, although there are various
> | optimisations you could apply, I might expect to represent that in 5
> | 32-bit words with packing.
> 
>   This is so heavy on state you want to optimize the storage? []

I did not say that it was necessary or desirable, merely possible.

I can imagine some cases in which it would be desirable to sacrifice
speed for reduced consing, although they would be unusual.
 
> | The other advantage is that we don't need to store the state in the
> | string at all, the transitory state is kept in the iterator (ie,
> | dosequence, map, subseq, etc), and this means that we can share the
> | string freely between readers, as we currently expect to be able to.
> 
>   I am really curious now.  You _always_ store the state in the object that
>   modifies it, _never_ in the object it refers to. []

Yes, that is what I'm advocating.

> | I think that a lot of state is the exception rather than the rule.
> 
>   You are actually wrong about this.  []

I may be wrong about this, but you would need to provide statistics to
demonstrate that a lot of state is the rule rather than the exception.

> | I also think that as shown above, we can externalise that state into
> | points, at an acceptable cost for reasonable encodings.
> 
>   I truly wonder how you could have thought that anyone would want to store
>   the iteration state in the object iterated over. []

Probably because of a reference to Emacs and mark/point.

> | It may be that I am unaware of some more complex common encodings, if
> | there are any that you are thinking of in specific, please let me know.
> 
>   Try implementing a full ISO 2022 processor, try representing the device
>   that ISO 6429 (informally known as "ANSI escape sequences") writes to, or
>   consider the amount of state in a fully fledged MIME processor. []

From a quick glance ISO-2022 doesn't seem enormously different to the
Unicode SCS, set-selection, lock-shift, character-escaping, etc.
Unfortunately the specification doesn't appear available on-line. If
you have a reference to such,  please provide it.

I'm not sure how display control sequences and MIME processing relate
to string encoding.

Regards,

Brian
From: ozan s. yigit
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <4da3d9af.0204010739.38220d77@posting.google.com>
·····@designix.com.au (Brian Spilsbury):

> No, random access means that the interface allows access to elements
> in a random order.

the term originally meant "uniform/unit-cost access" for any element in
any order. vectors have this property. hash tables in general do not.

oz
From: Erik Naggum
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <3226648279194697@naggum.net>
* Brian Spilsbury
| It's possible, but you have provided no reasoning or references.

  I generally do not consider it my job to unconfuse peoplw who make claims
  that something untrue is true.  In fact, I take part in a discussion with
  the premise that those I talk to have done their own homework.  If they
  have not and are not inclined to do it upon request, there can be no
  discussion.

| The definition of an array is such that you could implement an array
| via a hash-bucket which accepted only integers in the specified range.

  
> Random access _means_ O(1).

| No, random access means that the interface allows access to elements in a
| random order.

  OK, so our terminology problem has just been compounded with
  stubbornness.

| This does not necessarily imply an O(1) access characteristic, although
| this might be commonly expected.

  If an implementation offers arrays that have anything other than O(1)
  access characteristics, it will be so resoundingly trashed that even
  inventing such silly interpretations indicates that you come here to
  quibble, not understand anything.

| What is ad-hoc is that loop is a nice baroque flow-control language which
| happens to have some support for iterating sequences in certain
| circumstances.

  (incf *troll-indicator*)

> Well, we just appear to have different tolerance of necessities

| This depends on how easily annoyed you are.

  Really?

> An example of a stateful encoding with an annoyingly large amount of
> state would be useful so I know where the amount becomes annoyingly
> large.

| The example of the SCS encoding is one that I would consider to have a
| relatively large amount of state carried between elements.

  SCS is nice and small by all standards.

| I can imagine some cases in which it would be desirable to sacrifice
| speed for reduced consing, although they would be unusual.

  Huh?  Why would anyone sacrifice speed for reduced consing?  Are you sure
  you know what you are talking about here?  Do you think using more memory
  leads to _slower_ code?  It is usually the opposite that is true.

| Yes, that is what I'm advocating.

  So you are just agreeing with me by arguing against what I suggest?

| > | I think that a lot of state is the exception rather than the rule.
| > 
| >   You are actually wrong about this.  []
| 
| I may be wrong about this, but you would need to provide statistics to
| demonstrate that a lot of state is the rule rather than the exception.

  How about you cough up some statistics to support your own claim!?

  (incf *troll-indicator*)

| > | I also think that as shown above, we can externalise that state into
| > | points, at an acceptable cost for reasonable encodings.
| > 
| >   I truly wonder how you could have thought that anyone would want to
| >   store the iteration state in the object iterated over. []
| 
| Probably because of a reference to Emacs and mark/point.

  OK, I see that this simile/analogy/metaphor thing is too complex for
  communication with you.  I shall adjust accordingly.

| I'm not sure how display control sequences and MIME processing relate
| to string encoding.

  Just think about it.  This kind of statefulness is also found in input
  editing, which may occur at different times.

  But I think you are a literate troll, and will probably not respond if
  you do not do any work on your own and only demand work of others when
  they doubt your statements.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Brian Spilsbury
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <f0f9d928.0204010810.2a2f986@posting.google.com>
Erik Naggum <····@naggum.net> wrote in message news:<················@naggum.net>...
> * Brian Spilsbury


> > Random access _means_ O(1).
>  
> | No, random access means that the interface allows access to elements in a
> | random order.
> 
>   OK, so our terminology problem has just been compounded with
>   stubbornness.

> | This does not necessarily imply an O(1) access characteristic, although
> | this might be commonly expected.
> 
>   If an implementation offers arrays that have anything other than O(1)
>   access characteristics, it will be so resoundingly trashed that even
>   inventing such silly interpretations indicates that you come here to
>   quibble, not understand anything.

I'm glad that you've recanted your position about random access
requiring O(1) access characteristics.

This is not a silly interpretation nor a quibble, it is essential to
the understanding of data-type interfaces and performance
characteristics.

Beyond which you have taken an aside note, and blown it out of all
proportion.

Accept that you made an incorrect assertion and move on.

The point that was being raised was that the spirit of the CL
definition of string in terms of vector severely hampered any variable
width encoding.

The aside point was that given the word of the CL definition of array
and therefore vector, you could actually implement such a variable
width encoding as a vector type and remain compliant.

Does this clarify the situation?

> | What is ad-hoc is that loop is a nice baroque flow-control language which
> | happens to have some support for iterating sequences in certain
> | circumstances.
> 
>   (incf *troll-indicator*)

Do you engage in personal attack in lieu of actual reasoning?

Can you provide meaningful disagreement with that assement of loop?

> > An example of a stateful encoding with an annoyingly large amount of
> > state would be useful so I know where the amount becomes annoyingly
> > large.
>  
> | The example of the SCS encoding is one that I would consider to have a
> | relatively large amount of state carried between elements.
> 
>   SCS is nice and small by all standards.

Give an example which is average by your standards.

> | I can imagine some cases in which it would be desirable to sacrifice
> | speed for reduced consing, although they would be unusual.
> 
>   Huh?  Why would anyone sacrifice speed for reduced consing?  Are you sure
>   you know what you are talking about here?  Do you think using more memory
>   leads to _slower_ code?  It is usually the opposite that is true.

Someone might be concerned with latency spikes from a non real-time
garbage-collector.

Again, this would be unusual. (As a side note, if one thing is usually
true, then it being false in an unusual situation is not in any way
conflicting.)

> | Yes, that is what I'm advocating.
> 
>   So you are just agreeing with me by arguing against what I suggest?

No. You misunderstood what I was saying.

> | > | I think that a lot of state is the exception rather than the rule.
> | > 
> | >   You are actually wrong about this.  []
> | 
> | I may be wrong about this, but you would need to provide statistics to
> | demonstrate that a lot of state is the rule rather than the exception.
> 
>   How about you cough up some statistics to support your own claim!?
> 
>   (incf *troll-indicator*)

Firstly I offered an opinion.

Secondly you rebutted this in harsh terms without any relevant
information supplied.

Thirdly you engaged in personal attacks when asked for justification
for your unsupported rebuttal.

Perhaps you need to re-think what trolling means.

Secondly, all of the examples that I showed have quite small amounts
of contextual state. utf-8, shift-jis, euc-jp, euc-kr. The one with
the most state is SCS. ISO 2022 doesn't look much heavier than SCS,
however I do not have access to the ISO 2022 specification.

You have failed to provide any reference to any character-stream
protocol which is heavier in such state. MIME and terminal control
sequences do not qualify.

Please do so, and do not make empty complaints about being forced to
do homework. This is called 'backing up your own argument'.

> | > | I also think that as shown above, we can externalise that state into
> | > | points, at an acceptablue cost for reasonable encodings.
> | > 
> | >   I truly wonder how you could have thought that anyone would want to
> | >   store the iteration state in the object iterated over. []
> | 
> | Probably because of a reference to Emacs and mark/point.
> 
>   OK, I see that this simile/analogy/metaphor thing is too complex for
>   communication with you.  I shall adjust accordingly.

Try to avoid personal attack if you want to be taken seriously.

> | I'm not sure how display control sequences and MIME processing relate
> | to string encoding.
> 
>   Just think about it.  This kind of statefulness is also found in input
>   editing, which may occur at different times.

Input editing deals largely with intermediate state, as opposed to
contextual state, and is not within the domain of the problem of
string representation and accessing.

If you mean something else, then please clarify, without personal
attacks.

>   But I think you are a literate troll, and will probably not respond if
>   you do not do any work on your own and only demand work of others when
>   they doubt your statements

The weight of the onus with disageeing statements falls upon the
person making the stronger claim. (for example 'You are actually wrong
about this.', in contrast with 'I think that a lot of state is the
exception rather than the rule.' which is a far weaker claim)

Secondly, what work have you done here apart from demand of myself
when you disagree? Avoid hypocritical positions.

Please also avoid engaging in personal attack.

It is no substitute for reasoned discussion.

At this point it does not appear likely that it will be profitable to
continue.

Regards,

Brian
From: Erik Naggum
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <3226667302683383@naggum.net>
* Brian Spilsbury
| I'm glad that you've recanted your position about random access
| requiring O(1) access characteristics.

  Troll.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Thomas F. Burdick
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <xcvu1qvns29.fsf@conquest.OCF.Berkeley.EDU>
·····@designix.com.au (Brian Spilsbury) writes:

> Erik Naggum <····@naggum.net> wrote in message news:<················@naggum.net>...
> >   []  Random access _means_ O(1). []
> 
> No, random access means that the interface allows access to elements
> in a random order.
> 
> This does not necessarily imply an O(1) access characteristic,
> although this might be commonly expected.

I cannot think of any way of having a random-access data structure
where lookups weren't O(1).  If you have some exceptional data
structure in mind, please say what it is, because no one else has
heard of it.

> As an example:
>  * Does a hash-bucket object provide a random-access accessor?

Yes, probably.

>  * Is it O(1) to access?

To the extent that it provides random access, yes.  In really
degenerate cases, hash tables can only provide linear access, which
means they're O(n), but in that case, they're not random access; but
then, you probably knew this.

>  * Does the degenerate case of a hash-bucket containing only one
> bucket implemented with a list give O(n) access?

Of course.  But it does not give random access, just a crappy
interface to a list.

> > | The real problem is that sequence doesn't define any iterative operators,
> > | only cons [as list] does via cdr/rest and dolist, and the ad-hoc support
> > | via loop.
> > 
> >   What is "ad-hoc" about it? []
> 
> What is ad-hoc is that loop is a nice baroque flow-control language
> which happens to have some support for iterating sequences in certain
> circumstances.

True, but the support for sequences in LOOP is not ad-hoc, it's nicely
integrated into the rest of LOOP.

> Loop is not an iteration primitive for sequences, and CL does not
> contain such a primitive to my knowledge.

Sure it does, MAP.  IMHO, CL could have used a DOSEQUENCE to go along
with MAP, but CL certainly gives you a general sequence iteration
facility.

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | No to Imperialist war |                        
     ,--'    _,'   | Wage class war!       |                        
    /       /      `-----------------------'                        
   (   -.  |                               
   |     ) |                               
  (`-.  '--.)                              
   `. )----'                               
From: Brian Spilsbury
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <f0f9d928.0204012102.6e76e5a4@posting.google.com>
···@conquest.OCF.Berkeley.EDU (Thomas F. Burdick) wrote in message news:<···············@conquest.OCF.Berkeley.EDU>...
> ·····@designix.com.au (Brian Spilsbury) writes:
> 
> > Erik Naggum <····@naggum.net> wrote in message news:<················@naggum.net>...
> > >   []  Random access _means_ O(1). []
> > 
> > No, random access means that the interface allows access to elements
> > in a random order.
> > 
> > This does not necessarily imply an O(1) access characteristic,
> > although this might be commonly expected.
> 
> I cannot think of any way of having a random-access data structure
> where lookups weren't O(1).  If you have some exceptional data
> structure in mind, please say what it is, because no one else has
> heard of it.
> 
> > As an example:
> >  * Does a hash-bucket object provide a random-access accessor?
> 
> Yes, probably.
> 
> >  * Is it O(1) to access?
> 
> To the extent that it provides random access, yes.  In really
> degenerate cases, hash tables can only provide linear access, which
> means they're O(n), but in that case, they're not random access; but
> then, you probably knew this.
> 
> >  * Does the degenerate case of a hash-bucket containing only one
> > bucket implemented with a list give O(n) access?
> 
> Of course.  But it does not give random access, just a crappy
> interface to a list.

Well, this is a consistent position to take.

However there are some implications which might not be obvious.

If we define random-access to be uniform time access, then the
addition of a cache mechanism to an otherwise random-access structure
causes it to stop being random-access (or at least become less
random-access).

Beyond this, it begs the question 'why is random-access called
random-access rather than uniform-time access?'

My understanding is that it is random-access in that sense that random
elements are necessarily unrelated, and therefore random-accesses are
likewise independent of one another, but may well be dependent upon
their own individual differences.

I think that it makes little sense to tie independent element access
back to uniform access time.

As an example, is your random-access memory random-access if we have
added a cache to it? By the definition which you have given, we would
at least have to say that it is 'less random-access' than uncached RAM
would be.

This does not seem particularly reasonable.

As a second example; Is a hard-drive random-access? The underlying
implementation certainly is not. The interface that we use to a
hard-drive tends to be.

This is a more interesting example, since the implementation's access
characteristics for different elements are not independent, but we
ignore this factor in the higher level interface, ie we deal with the
sequential access implementation of the hard-drive though an
abstraction which provides a random access interface.

My feeling is that for a consistent view of random-access we need to
consider whether access to a given element is dependent upon access to
another element at the level of the interface that is exposed.

This means that I need to accept a hash-bucket structure as
random-access, but I can still talk about lousy degenerate
performance.

As a final note you've ended up with a hash-bucket's random-access
nature being undefined.

> > > | The real problem is that sequence doesn't define any iterative operators,
> > > | only cons [as list] does via cdr/rest and dolist, and the ad-hoc support
> > > | via loop.
> > > 
> > >   What is "ad-hoc" about it? []
> > 
> > What is ad-hoc is that loop is a nice baroque flow-control language
> > which happens to have some support for iterating sequences in certain
> > circumstances.
> 
> True, but the support for sequences in LOOP is not ad-hoc, it's nicely
> integrated into the rest of LOOP.

Yes, but not into the rest of CL :)

I'm not saying that loop is a bad thing, which is why I added nice.

> > Loop is not an iteration primitive for sequences, and CL does not
> > contain such a primitive to my knowledge.
> 
> Sure it does, MAP.  IMHO, CL could have used a DOSEQUENCE to go along
> with MAP, but CL certainly gives you a general sequence iteration
> facility.

Map and the associated functions do iterate across sequences.

There are two things that are lacking in this regard though, imho.

One is an ability to iterate a subsequence.

The other is the ability to provide access to the sequence being
iterated from the current position.

As an example, consider using map to implement a LALR(1) parser.

We can have no look-ahead at all, so we must look backward, which we
can do.

We cannot know when we're about to terminate (unless we track our
position and the length manually).

We could implement a string parser like;

(let ((last nil) (state (make-state))
  (map nil (lambda (char) (build-state state last char) (setf last
char)) buffer)
  ; handle the last element
  (build-state state last (elt (- (length buffer) 1)))
  state)

I do not think that it is reasonable to view map as being a general
iteration mechanism.

I think that map is quite sufficient as a mapping mechanism without
trying to shoehorn things like this in. :)

Regards,

Brian
From: Thomas F. Burdick
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <xcvbsd2h1ku.fsf@famine.OCF.Berkeley.EDU>
·····@designix.com.au (Brian Spilsbury) writes:

> Well, this is a consistent position to take.
> 
> However there are some implications which might not be obvious.
> 
> If we define random-access to be uniform time access, then the
> addition of a cache mechanism to an otherwise random-access structure
> causes it to stop being random-access (or at least become less
> random-access).
> 
> Beyond this, it begs the question 'why is random-access called
> random-access rather than uniform-time access?'
> 
> My understanding is that it is random-access in that sense that random
> elements are necessarily unrelated, and therefore random-accesses are
> likewise independent of one another, but may well be dependent upon
> their own individual differences.
> 
> I think that it makes little sense to tie independent element access
> back to uniform access time.

I think that constant-time, or at least approximately constant-time,
access is a part of random access.  Otherwise the term is meaningless
-- after all, any item in a list can be accessed independently of the
others:

  (nth x some-list)

Is a list a random-access data structure?  No, because access must
occur linearly, even if it isn't obvious from the interface.  This is
I think the source of the apparent confusion : the interface doesn't
define the data structure.  You might say that NTH provides a
random-access interface to lists, but lists certainly aren't
random-access data structures.

> As an example, is your random-access memory random-access if we have
> added a cache to it? By the definition which you have given, we would
> at least have to say that it is 'less random-access' than uncached RAM
> would be.
> 
> This does not seem particularly reasonable.

[ It's too late for me to actually think about this.  Rather than
  dismiss your point, I'll come back to it tomorrow.  My gut instinct
  is to say that hardware is different than data structures. ]

> As a second example; Is a hard-drive random-access? The underlying
> implementation certainly is not. The interface that we use to a
> hard-drive tends to be.
> 
> This is a more interesting example, since the implementation's access
> characteristics for different elements are not independent, but we
> ignore this factor in the higher level interface, ie we deal with the
> sequential access implementation of the hard-drive though an
> abstraction which provides a random access interface.

Yes, exactly.

> My feeling is that for a consistent view of random-access we need to
> consider whether access to a given element is dependent upon access to
> another element at the level of the interface that is exposed.

I think that's a good way of talking about random-access interfaces.
As for the data structure itself, I'd only classify it as
random-access if access to a given element is independent of access to
other elements at the level of operating directly on the data
structure.  This of course implies O(1) access.

> This means that I need to accept a hash-bucket structure as
> random-access, but I can still talk about lousy degenerate
> performance.

The interface is certainly random-access.

> As a final note you've ended up with a hash-bucket's random-access
> nature being undefined.

True.  However, for particular strategies of implementing hash tables,
it can be well defined.  For an implementation that uses a single
bucket, and is thus effectively an alist, it is no doubt linear, not
random, access.  For a hash on a specific data type that was able to
guarantee no hash-bucket collisions, no doubt it would be random
access because the process of looking up an entry would not depend on
finding any other entries.

> > > Loop is not an iteration primitive for sequences, and CL does not
> > > contain such a primitive to my knowledge.
> > 
> > Sure it does, MAP.  IMHO, CL could have used a DOSEQUENCE to go along
> > with MAP, but CL certainly gives you a general sequence iteration
> > facility.

Oops, I said "general" and you didn't.  To your original statement of
"CL does not contain [an iteration primitive for sequences]," I say:
MAP.  But you're right, it's not a general iteration facility for
sequences.

> Map and the associated functions do iterate across sequences.
> 
> There are two things that are lacking in this regard though, imho.
> 
> One is an ability to iterate a subsequence.

True.  And unfortunately the only way to add this is to write the
whole thing :(

> The other is the ability to provide access to the sequence being
> iterated from the current position.

If I understand you correctly, I'm not sure I agree that this is
necessary for an iteration facility to be general.  What I think you
mean is the ability to do something like:

  (sequence-iterator (char ("hi dee ho" :start 1 :end 5) current-position)
    (when (char= char #\Space)
      (return current-position)))

  #<The POSITION 2 in the vector "hi dee ho" from 1 to 5>

  (sequence-iterator (char *)
    (princ char))

  dee

That would be a nice language feature, though.  You could obviously
get yourself into trouble with (SETF CDR) of a list you were iterating
over, but that's always an issue.

> As an example, consider using map to implement a LALR(1) parser.
> 
> We can have no look-ahead at all, so we must look backward, which we
> can do.
> 
> We cannot know when we're about to terminate (unless we track our
> position and the length manually).
> 
> We could implement a string parser like;
> 
> (let ((last nil) (state (make-state))
>   (map nil (lambda (char) (build-state state last char) (setf last
> char)) buffer)
>   ; handle the last element
>   (build-state state last (elt (- (length buffer) 1)))
>   state)
> 

I'm not quite sure what you're trying to get at with your example,
though.  Your algorithm is to build the state based on the last state,
the last char, and the current char, then at the end build the final
state from the last state, and the last char as both the current char
and the last char.  That's kind of weird, and the last step is
definately different from the previous steps.  How would you imagine
that fitting into a single iteration construct?  In LOOP or in an
Algol- or Pascal-like language you'd use something like FINALLY.
Personally, I think it's weird to have the iterator keep track of the
last char instead of having the state object do that.  I'd have
written the code:

  (let ((state (make-state)))
    (map nil (lambda (char) (build-state state char))
         buffer)
    (finalize-state state))

or, more likely:

  (use-package #:tfb-utilities)
  (let ((state (make-state)))
    (dosequence (char buffer) (build-state state char))
    (finalize-state state))

That seems like a more natural way to cut the problem, and I don't
feel like it's just working around insufficient iteration facilities.

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | No to Imperialist war |                        
     ,--'    _,'   | Wage class war!       |                        
    /       /      `-----------------------'                        
   (   -.  |                               
   |     ) |                               
  (`-.  '--.)                              
   `. )----'                               
From: Brian Spilsbury
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <f0f9d928.0204020621.26694e96@posting.google.com>
···@famine.OCF.Berkeley.EDU (Thomas F. Burdick) wrote in message news:<···············@famine.OCF.Berkeley.EDU>...
> ·····@designix.com.au (Brian Spilsbury) writes:
> 
> I think that constant-time, or at least approximately constant-time,
> access is a part of random access.  Otherwise the term is meaningless
> -- after all, any item in a list can be accessed independently of the
> others:
> 
>   (nth x some-list)
>
> Is a list a random-access data structure?  No, because access must
> occur linearly, even if it isn't obvious from the interface.  This is
> I think the source of the apparent confusion : the interface doesn't
> define the data structure.  You might say that NTH provides a
> random-access interface to lists, but lists certainly aren't
> random-access data structures.

I think that you cannot consider a data-structure to be random-access
or sequential-access, but I think that you can consider various
operators to be one or the other.

http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?random-access+memory

sums it up nicely, imho

"...for which the order of access to different locations does not
affect the speed of access"

Which is not to say that it might not have a cost up to O(n) to access
a particular location for a particular implementation.

I would have to say that NTH is a random access operator, with O(n)
access characteristics when implemented over car/cdr.

[as a counterpoint, consider NTH implemented for a cdr-coded list :),
it remains random-access, but with O(1) access characteristics]

>
> > As an example, is your random-access memory random-access if we have
> > added a cache to it? By the definition which you have given, we would
> > at least have to say that it is 'less random-access' than uncached RAM
> > would be.
> > 
> > This does not seem particularly reasonable.
> 
> [ It's too late for me to actually think about this.  Rather than
>   dismiss your point, I'll come back to it tomorrow.  My gut instinct
>   is to say that hardware is different than data structures. ]

Fair enough, I'll just ask what the difference between a
data-structure definition and the specification of a machine is.

> > Map and the associated functions do iterate across sequences.
> > 
> > There are two things that are lacking in this regard though, imho.
> > 
> > One is an ability to iterate a subsequence.
> 
> True.  And unfortunately the only way to add this is to write the
> whole thing :(
> 
> > The other is the ability to provide access to the sequence being
> > iterated from the current position.
> 
> If I understand you correctly, I'm not sure I agree that this is
> necessary for an iteration facility to be general.  What I think you
> mean is the ability to do something like:
> 
>   (sequence-iterator (char ("hi dee ho" :start 1 :end 5) current-position)
>     (when (char= char #\Space)
>       (return current-position)))
> 
>   #<The POSITION 2 in the vector "hi dee ho" from 1 to 5>
> 
>   (sequence-iterator (char *)
>     (princ char))
> 
>   dee
> 
> That would be a nice language feature, though.  You could obviously
> get yourself into trouble with (SETF CDR) of a list you were iterating
> over, but that's always an issue.

Well, sequence doesn't support cdr, so that would be a problem, but
yes, this looks like the interface I was arguing for with Erik
earlier.

The idea of iterating with points which could be saved in order to
restart/continue an iteration over an arbitrary sequence.

Then you could, for example, iterate over a string, remember where the
start and finish of the words were (as points into the sequence).

And then (subseq sequence :start start :end end)

In order to do this, a 'point object' would need to store the index it
referred to so that you could mix points and indices.

I suspect we're in agreement from the example above.

[as a caveat the scope in which points might be safely used might need
to be carefully delineated]

> I'm not quite sure what you're trying to get at with your example,

Well, it was just an example to show that while you could use map to
implement a LALR(1) parser, it wouldn't be a sensible approach.

Build-state was poorly named.

Regards,

Brian
From: Nate Holloway
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <189890ca.0204030818.348aab6@posting.google.com>
·····@designix.com.au (Brian Spilsbury) wrote:
<In a previous article:>
> If we define random-access to be uniform time access, then the
> addition of a cache mechanism to an otherwise random-access structure
> causes it to stop being random-access (or at least become less
> random-access).
>
> Beyond this, it begs the question 'why is random-access called
> random-access rather than uniform-time access?'

There is a big difference between a uniform-time process and an O(1) process.
Consider the computation of some function f(N) which takes (mod N x) seconds,
where x << N. This function does not have O(N), nor is it "uniform-time".
Variation in latency due to factors uncorrelated to the magnitude of the argu-
ment are not what Order of complexity is designed to measure.
Additionally, a cache is not like a hash table - there is no recourse to a
sequential search. If a cache misses the access is punted to a deeper level
of machinery with a larger, yet equally constant-bounded latency. As the storage
hierarchy is finite the time complexity of access is always O(1).

<In his most recent article:>
> ···@famine.OCF.Berkeley.EDU (Thomas F. Burdick) wrote in message news:<···············@famine.OCF.Berkeley.EDU>...
> > I think that constant-time, or at least approximately constant-time,
> > access is a part of random access.  Otherwise the term is meaningless
> > -- after all, any item in a list can be accessed independently of the
> > others:
> > 
> >   (nth x some-list)
> >
> > Is a list a random-access data structure?  No, because access must
> > occur linearly, even if it isn't obvious from the interface.  This is
> > I think the source of the apparent confusion : the interface doesn't
> > define the data structure.  You might say that NTH provides a
> > random-access interface to lists, but lists certainly aren't
> > random-access data structures.
> I think that you cannot consider a data-structure to be random-access
> or sequential-access, but I think that you can consider various
> operators to be one or the other.

You said that you don't think that saying an operator is "random-access" implies
its performance is bounded by O(1). You seem to want to view data structures as
patterns of bits without evidence of intent. I would suggest that a data struc-
ture is an artifact that embodies an intent and as such can be defensibly said
to have an efficiency or a complexity for its intended purpose.

> http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?random-access+memory
>
> sums it up nicely, imho
>
> "...for which the order of access to different locations does not
> affect the speed of access"

I do not think the definition from FOLDOC is very good, because I believe a
hard disk is a random-access device. (What is the category with which random
access is pairwise disjoint? Is it not sequential access? There are good reasons
not to call a hard disk a sequential-access device. Even the cyclic tracks that
compose the disk need not be sequentially accessed; think interleaving.)

I suppose it could be argued that the alternative to random access is any
kind of ordered access, not just linearly sequential. But for this category to
have a useful weight of meaning, the orderedness would have to be enforced;
clearly ordered access is allowed as a type of random access. What would it then
mean to say a storage device (like a disk) had ordered access characteristics?
To say such access patterns were enforced would mean there were some number M
of sets composing the disk, each one of which provided addressable elements in
a fixed order, such that an element N+1 always became accessible after an ele-
ment N. In this kind of arrangement, we still could not say that the complexity
of access was O(N) if M is large compared to N.

By the way, I don't think this story is particularly realistic. The latencies
associated with disk access have very little correlation with the magnitude of
the address, since the platter surface doesn't start in the same position
prior to each access.

> Which is not to say that it might not have a cost up to O(n) to access
> a particular location for a particular implementation.

You have selectively quoted FOLDOC to manufacture the impression that it agrees
with your tendentious argument. In fact, the same definition later claims:
    "Interestingly, some {DRAM} devices are not truly random access because var-
    ious kinds of "{page mode}" or "column mode" mean that sequential access is
    faster than random access."
So what FOLDOC actually says is that any small-scale deviation from uniform lat-
ency makes it "not truly random access". You are saying that an operation is
random-access whenever its interface addresses its parts using absolute
indices, regardless of latency.

Let us look at the different perspectives here to see how they differ.

* This definition from FOLDOC says that "random-access" applies
  only to structures where each access completes in the same amount of time.
* You assert that "random-access" has essentially no meaning when applied to a
  data structure, only when applied to the interface of some operation, and it
  says nothing about the order of complexity of the operation.

There is a midpoint between these views (very restrictive, on one hand, and
meaningless, on the other).

* A random-access process must complete in O(1) time. It can take any amount
  of time for each access as long as the upper bound on these times is not
  affected by the magnitude of the address. A random-access structure is one
  which allows this process.

> I would have to say that NTH is a random access operator, with O(n)
> access characteristics when implemented over car/cdr.
>
> [as a counterpoint, consider NTH implemented for a cdr-coded list :),
> it remains random-access, but with O(1) access characteristics]

Certainly not. Even if we assume that each Q of the list has cdr-next except for
the last Q having cdr-nil, NTH is required to cdr down sequentially, since it
has no other way of telling when it has reached the end. Cdr-coding of lists is
also not either/or. The list on which NTH is called may have been produced by
LIST*ing a cdr-coded list onto a normally-coded list, and the only way for NTH
to be correct in the face of this possibility is to examine each Q.

> > > As an example, is your random-access memory random-access if we have
> > > added a cache to it? By the definition which you have given, we would
> > > at least have to say that it is 'less random-access' than uncached RAM
> > > would be.
> > This does not seem particularly reasonable.
> > [ It's too late for me to actually think about this.  Rather than
> >   dismiss your point, I'll come back to it tomorrow.  My gut instinct
> >   is to say that hardware is different than data structures. ]
> Fair enough, I'll just ask what the difference between a
> data-structure definition and the specification of a machine is.
> > > > Map and the associated functions do iterate across sequences.
> > > There are two things that are lacking in this regard though, imho.
> > > One is an ability to iterate a subsequence.
> > True.  And unfortunately the only way to add this is to write the
> > whole thing :(

(defmacro dosequence ((element sequence &key start end) &body body)
   (let ((ruler (gensym "DOSEQUENCE-RULER"))
         (mark  (gensym "DOSEQUENCE-MARK"))
         (subseqp (or start end)))
     `(block nil
         (let (,@(when subseqp
                   `((,ruler (make-array (list ,(or start 0))
                                         :initial-element nil
                                         :adjustable t)))))
            ,(when subseqp `(setq ,ruler (adjust-array ,ruler
                                            (list ,(or end
                                                     `(length ,sequence)))
                                            :initial-element t)))
            (map nil #'(lambda (,element ,@(when subseqp
                                             `(,mark)))
                          ,(if subseqp
                             `(when ,mark (progn ,@body))
                             `(progn ,@body)))
                     ,sequence
                     ,@(when subseqp
                          `(,ruler)))))))

I haven't tested or timed this, so your risks are your own.
(I don't think it can handle declarations without more work. Try LET instead
of PROGN.)

> > > The other is the ability to provide access to the sequence being
> > > iterated from the current position.

How is this different from an indirect array?

> > > > If I understand you correctly, I'm not sure I agree that this is
> > necessary for an iteration facility to be general.  What I think you
> > mean is the ability to do something like:
> ....

Regards,
Nate

P.S. Just a brief note. This 'argument' seemed to really reach fever pitch
     when you elevated this conceptual idea you had hatched that "random-access"
     - or more specifically ARRAYS, were really something different than what
     people have practically and blithely regarded them for many decades. In a
     way, this is a creative, or at least iconoclastic, approach to thinking.
     But I think you have fallen into a trap that if your idea is not explicitly
     excluded by any authoritative document it is a valid idea. It is far more
     important that computer systems serve human needs and meet human expectat-
     ions than this business of 'cleanliness' and 'purity'. It's a little like
     trying to argue against somebody's social or artistic criticism by quoting
     a dictionary at them.
From: William D Clinger
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <b84e9a9f.0204031844.edad4d5@posting.google.com>
Warning: geek humor, but a true story.

Nate Holloway made the very good point that
> There is a big difference between a uniform-time process and an O(1) process.
> Consider the computation of some function f(N) which takes (mod N x) seconds,
> where x << N. This function does not have O(N), nor is it "uniform-time".
> Variation in latency due to factors uncorrelated to the magnitude of the argu-
> ment are not what Order of complexity is designed to measure.
> Additionally, a cache is not like a hash table - there is no recourse to a
> sequential search. If a cache misses the access is punted to a deeper level
> of machinery with a larger, yet equally constant-bounded latency. As the storage
> hierarchy is finite the time complexity of access is always O(1).

I agree with everything except the last sentence.

Once upon a time, I was writing a long program, and made the rookie
mistake of not saving my work.  Hours and hours of work.  Lost in
concentration, I didn't notice the coming thunderstorm.

FLASH/KABOOM!!!!!  The delay between the lightning and the thunder
is about 5 seconds per mile, so this lightning bolt had struck maybe
100 feet away.  The entire neighborhood went dark.

Except for my Apple PowerBook 170 laptop, which was still running on
battery power.  Unfortunately, I was using an external hard disk as
my paging device, and it had lost power.  I could still edit my file,
but how could I save it without involving the SCSI bus?

Brilliant idea!  I'll save my file to a floppy disk!  I inserted a
floppy, which showed up on my desktop.  It's going to work!  I pulled
down the "File" menu to save my file...and my PowerBook locked up.
Because I hadn't saved my file for a long time, the File menu was in
a part of virtual memory that had been tossed out of RAM, so I hit a
page fault, so I was hosed.

Despair.  My PowerBook is processing a page fault, and the paging disk
doesn't have power.  What to do?  The only thing I could think to do
was to wait for the power to come back on, and to hope that the page
fault processing would complete when the disk came back on line.

Two hours after the thunderbolt, my lights came back on, the disk spun
back up, and the page fault completed.  I was able to save my work.

So I'm not so sure that the virtual memory hierarchy is really O(1).

Will
From: Rahul Jain
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <873cyccd81.fsf@photino.sid.rice.edu>
······@qnci.net (William D Clinger) writes:

> Two hours after the thunderbolt, my lights came back on, the disk spun
> back up, and the page fault completed.  I was able to save my work.
> 
> So I'm not so sure that the virtual memory hierarchy is really O(1).

Cool story! :)

But... all that means is that the constant factor was temporarily
increased due to a change in the system.

-- 
-> -/                        - Rahul Jain -                        \- <-
-> -\  http://linux.rice.edu/~rahul -=-  ············@techie.com   /- <-
-> -/ "Structure is nothing if it is all you got. Skeletons spook  \- <-
-> -\  people if [they] try to walk around on their own. I really  /- <-
-> -/  wonder why XML does not." -- Erik Naggum, comp.lang.lisp    \- <-
|--|--------|--------------|----|-------------|------|---------|-----|-|
   (c)1996-2002, All rights reserved. Disclaimer available upon request.
From: William D Clinger
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <b84e9a9f.0204050815.17beb51d@posting.google.com>
Rahul Jain wrote:
> But... all that means is that the constant factor was temporarily
> increased due to a change in the system.

I think Rahul has just invented a sorting algorithm that runs in
O(1) time.
:)

Will
From: Rahul Jain
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <87zo0hhl4d.fsf@photino.sid.rice.edu>
······@qnci.net (William D Clinger) writes:

> Rahul Jain wrote:
> > But... all that means is that the constant factor was temporarily
> > increased due to a change in the system.
> 
> I think Rahul has just invented a sorting algorithm that runs in
> O(1) time.
> :)

Yeah! it's a secret! But I'm selling it for the low, low price of $1
million! But wait, that's not all! If you buy now, you'll get a FREE
proof that P = NP! Don't delay! Call now! Operators are standing by!

But seriously, just because your network goes down doesn't mean that
your distributed algorithm is now O(n!) instead of O(n). It just means
that you add a constant to the complexity of the duration of the
failure.

-- 
-> -/                        - Rahul Jain -                        \- <-
-> -\  http://linux.rice.edu/~rahul -=-  ············@techie.com   /- <-
-> -/ "Structure is nothing if it is all you got. Skeletons spook  \- <-
-> -\  people if [they] try to walk around on their own. I really  /- <-
-> -/  wonder why XML does not." -- Erik Naggum, comp.lang.lisp    \- <-
|--|--------|--------------|----|-------------|------|---------|-----|-|
   (c)1996-2002, All rights reserved. Disclaimer available upon request.
From: Kent M Pitman
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <sfwpu1dwy90.fsf@shell01.TheWorld.com>
Rahul Jain <·····@sid-1129.sid.rice.edu> writes:

> ······@qnci.net (William D Clinger) writes:
> 
> > Rahul Jain wrote:
> > > But... all that means is that the constant factor was temporarily
> > > increased due to a change in the system.
> > 
> > I think Rahul has just invented a sorting algorithm that runs in
> > O(1) time.
> > :)
> 
> Yeah! it's a secret!

Watch out.  Those O(1)'s can be elusive.

I once obsessesd for a while about how you could use tricks like
church numerals use to represent most data structures as patterns of
closures, but that the O(1) lookup time of arrays was hard to simulate
that way since just unwrapping the closures necessarily took time, and
takes different time depending on how many wrapper closures you have.

It was a long time, and took some help from others, before I
acknowledged the obvious truth, that even the O(1) times for
number-related things are mostly tricks, and that really even indexed
lookup is very likely doing some sort of game with logarithmic time
lookups even in hardware, it's just that these are done so much faster
than the others by the hardawre that don't see the differences in
effect, plus you can bias the O(log(n)) so that even though it takes
time proportional to the number of bits and since all numbers have the
same number of bits, you _really_ don't see it.  By  bounding the upper
end of fixnum operations and assuring that people pass a full set of
binary digits, it looks a lot more uniform.  But that's only because
the full generality of integers is lost until you move to bignums,
there the effect is more strongly felt.

Anyway, if you really understand integer-indexed things to not really
be O(1) then you can make a cascade of lambdas that really can "keep
up" speedwise with the illusion of constant access, because the whole
"constant time" thing was an illusion...

At least, I think I said that right. I'm being a little vague because I
didn't get a full night's sleep last night.  Time for me to nap awhile
and let some helpful soul correct me if I've misstated the issue.
From: Rahul Jain
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <87pu1dgy48.fsf@photino.sid.rice.edu>
Kent M Pitman <······@world.std.com> writes:

> Watch out.  Those O(1)'s can be elusive.

> It was a long time, and took some help from others, before I
> acknowledged the obvious truth, that even the O(1) times for
> number-related things are mostly tricks, and that really even indexed
> lookup is very likely doing some sort of game with logarithmic time
> lookups even in hardware [...]

Yeah, complexity calculations are, of course, usually done very
abstractly. Specifically, we agree that some set of calculations are
O(1), e.g. integer arithmetic operations. Of course, that assumes we
are using hardware-implemented fixed-word operations, etc...

-- 
-> -/                        - Rahul Jain -                        \- <-
-> -\  http://linux.rice.edu/~rahul -=-  ············@techie.com   /- <-
-> -/ "Structure is nothing if it is all you got. Skeletons spook  \- <-
-> -\  people if [they] try to walk around on their own. I really  /- <-
-> -/  wonder why XML does not." -- Erik Naggum, comp.lang.lisp    \- <-
|--|--------|--------------|----|-------------|------|---------|-----|-|
   (c)1996-2002, All rights reserved. Disclaimer available upon request.
From: Brian Spilsbury
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <f0f9d928.0204032316.462b7786@posting.google.com>
····@crackaddict.com (Nate Holloway) wrote in message news:<···························@posting.google.com>...
> ·····@designix.com.au (Brian Spilsbury) wrote:
> <In a previous article:>
> > If we define random-access to be uniform time access, then the
> > addition of a cache mechanism to an otherwise random-access structure
> > causes it to stop being random-access (or at least become less
> > random-access).
> >
> > Beyond this, it begs the question 'why is random-access called
> > random-access rather than uniform-time access?'
> 
> There is a big difference between a uniform-time process and an O(1) process.
> Consider the computation of some function f(N) which takes (mod N x) seconds,
> where x << N. This function does not have O(N), nor is it "uniform-time".

Yes, this is an good point.

[...]

> You seem to want to view data structures as
> patterns of bits without evidence of intent. I would suggest that a data struc-
> ture is an artifact that embodies an intent and as such can be defensibly said
> to have an efficiency or a complexity for its intended purpose.

Well, this is a rather odd assertion.

I don't see how anything that I've said could be interpreted as
viewing ADTs as patterns of bits.

An ADT can certainly be said to have an efficiency and a complexity
for a particular purpose, in-so-far as it provides the constraints
from which you can derive such efficiencies.

Ie, if nth is specified to be in terms of car/cdr, you can quite
happily show that nth is O(n), if not, then your options may be more
open.

[...]

> By the way, I don't think this story is particularly realistic. The latencies
> associated with disk access have very little correlation with the magnitude of
> the address, since the platter surface doesn't start in the same position
> prior to each access.

This again comes back to the ADT vs' implementation.

A hard-drive may well be a sequential storage device, but the
interface that we use to a hard-drive is a random access one.

You can consider a hard-drive with a random-block accessor to be a
random access device implemented with a substrate which does not have
uniform access times.

If we care to access the implementation here with an interface which
exploits the implementation's underlying non-random access
characteristics, then there might be some advantage in that. (Not
having a simple correlation with the magnitude of the address doesn't
really matter, as long as we can have more/less optimal performance
from ordering our accesses in various ways.)

> > Which is not to say that it might not have a cost up to O(n) to access
> > a particular location for a particular implementation.
> 
> You have selectively quoted FOLDOC to manufacture the impression that it agrees
> with your tendentious argument. In fact, the same definition later claims:
>     "Interestingly, some {DRAM} devices are not truly random access because var-
>     ious kinds of "{page mode}" or "column mode" mean that sequential access is
>     faster than random access."
> So what FOLDOC actually says is that any small-scale deviation from uniform lat-
> ency makes it "not truly random access". You are saying that an operation is
> random-access whenever its interface addresses its parts using absolute
> indices, regardless of latency.

That last sentence is obviously phased poorly, what it seems to mean
is that DRAM can be accessed both as a random-access device and a
sequential-access device (ie via different operations) and the
sequential-access operations give better performance under some
conditions.

The self-contradictory element is in the 'not truely random access'
followed by the 'random access' bit, does adding the ability to
sequentially access something magically degrade its ability to be
randomly accessed?

> Let us look at the different perspectives here to see how they differ.
> 
> * This definition from FOLDOC says that "random-access" applies
>   only to structures where each access completes in the same amount of time.

Actually it says that the time to access is not affected by prior
accesses.

The main implication of this is that given a random-access interface
there are no optimal/suboptimal access orderings possible.

> * You assert that "random-access" has essentially no meaning when applied to a
>   data structure, only when applied to the interface of some operation, and it
>   says nothing about the order of complexity of the operation.

Well, my assertion is that a data-structure is fundamentally an
identity upon which various operations are applicable.

I think that random-access is quite meaningful when applied to
'accesses to an array object', since 'accesses to an array object' is
an operation.

> There is a midpoint between these views (very restrictive, on one hand, and
> meaningless, on the other).
> 
> * A random-access process must complete in O(1) time. It can take any amount
>   of time for each access as long as the upper bound on these times is not
>   affected by the magnitude of the address. A random-access structure is one
>   which allows this process.

Given that you've defined your O in terms of the access operator, I'll
agree.

> > I would have to say that NTH is a random access operator, with O(n)
> > access characteristics when implemented over car/cdr.
> >
> > [as a counterpoint, consider NTH implemented for a cdr-coded list :),
> > it remains random-access, but with O(1) access characteristics]
> 
> Certainly not. Even if we assume that each Q of the list has cdr-next except for
> the last Q having cdr-nil, NTH is required to cdr down sequentially, since it
> has no other way of telling when it has reached the end. Cdr-coding of lists is
> also not either/or. The list on which NTH is called may have been produced by
> LIST*ing a cdr-coded list onto a normally-coded list, and the only way for NTH
> to be correct in the face of this possibility is to examine each Q.

Well, that's a good point, although a compiler may well know how long
a particular cdr-coded list must be at a given point, and may be able
to leverage this. The use of a known cdr-coded list literal would be a
simple example of where this would be possible.

A better example would have been python style lists, which are really
vectors, and have O(1) access.

The main point is that we could certainly implement lists such that
nth has O(1) access characteristics without violating the list ADT,
and vice verse, although it might be rather pointless to do so.

I think what this means is that unless the access characteristics of
an ADT operation are constrained in the ADT, we must not attribute
access characteristics to that ADT. Unless the list ADT specifies that
NTH must be implemented in terms of CDR/CAR then I do not think that
we can justify asserting that NTH must be O(n) for all list
implementations.

> > > > The other is the ability to provide access to the sequence being
> > > > iterated from the current position.
> 
> How is this different from an indirect array?

I'll assume that you mean displaced array here.

In which case, it differs only in being a displaced sequence, rather
than a displaced array.

I agree that displaced-sequence would be a much better term than
point.

> > > > > If I understand you correctly, I'm not sure I agree that this is
> > > necessary for an iteration facility to be general.  What I think you
> > > mean is the ability to do something like:
> > ....
> 
> Regards,
> Nate
> 
> P.S. Just a brief note. This 'argument' seemed to really reach fever pitch
>      when you elevated this conceptual idea you had hatched that "random-access"
>      - or more specifically ARRAYS, were really something different than what
>      people have practically and blithely regarded them for many decades. In a
>      way, this is a creative, or at least iconoclastic, approach to thinking.
>      But I think you have fallen into a trap that if your idea is not explicitly
>      excluded by any authoritative document it is a valid idea. It is far more
>      important that computer systems serve human needs and meet human expectat-
>      ions than this business of 'cleanliness' and 'purity'. It's a little like
>      trying to argue against somebody's social or artistic criticism by quoting
>      a dictionary at them.

If you read back, you'll see that erik took a parenthesied side-note
that CL did not specifically require arrays to be O(1) access and took
exception to it.

This was never a significant point in what I was talking about.

My main complaint was that CL had defined strings in terms of vector
rather than sequence, for which I can find no good justification,
although I'm sure it exists.

If you know of a sensible reason for this decision, please let me
know.

Regards,

Brian
From: Kent M Pitman
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <sfw1ydziyyw.fsf@shell01.TheWorld.com>
·····@designix.com.au (Brian Spilsbury) writes:

> A sequence contains things which have both
> vector-access-characteristics and list-access-characteristics.

No.

A sequence contains things which have EITHER
vector-access-characteristics OR list-access-characteristics.

The SET of all sequences admits BOTH things that have 
vector-access-characteristics AND things that have 
list-access-characteristics.

These are different statements.
From: Brian Spilsbury
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <f0f9d928.0204012127.6675321d@posting.google.com>
Kent M Pitman <······@world.std.com> wrote in message news:<···············@shell01.TheWorld.com>...
> ·····@designix.com.au (Brian Spilsbury) writes:
> 
> > A sequence contains things which have both
> > vector-access-characteristics and list-access-characteristics.
> 
> No.
> 
> A sequence contains things which have EITHER
> vector-access-characteristics OR list-access-characteristics.
> 
> The SET of all sequences admits BOTH things that have 
> vector-access-characteristics AND things that have 
> list-access-characteristics.
> 
> These are different statements.

Sequence is not just defined by the type-restriction.

"Sequences are ordered collections of objects, called the elements of
the sequence.

The types vector and the type list are disjoint subtypes of type
sequence, but are not necessarily an exhaustive partition of
sequence."

The not necessarily exhaustive clause is important imho, since it
allows for things which are sequences, but neither vector nor list.

Given this we need to look at the operations defined upon objects of
type sequence.

elt, length, subseq, copy-seq, fill, replace, count, position, ...

Some of these operations use independent element acccess, some of
these use interdependent element access, ie elt vs' position.

The unexhaustive partition note indicates that you cannot reduce
sequence to list XOR vector, and must consider sequence to be an ADT
of its own, with two common implementations.

I do agree that my statement above was problematic, thank you for
pointing this out.

Regards,

Brian
From: Kent M Pitman
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <sfw663aps3v.fsf@shell01.TheWorld.com>
·····@designix.com.au (Brian Spilsbury) writes:

> The types vector and the type list are disjoint subtypes of type
> sequence, but are not necessarily an exhaustive partition of
> sequence."

Yes, for better or worse, this is left to _vendor_ experimentation.

A vendor, of course, can pass through experimentation capability to you.

As the NBS rep pointed out early on in the standards process, it's not the
role of a standards committee to do design.  We did it sometimes, but always
as a last resort in order to achieve consensus when the options were in 
conflict.  The first choice, though, is to have one or more vendors with a
happy experience to report..

So I'd work on convincing my vendor if I were you...
From: Tim Moore
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <a8bi4t$k9u$0@216.39.145.192>
On Tue, 2 Apr 2002 05:49:40 GMT, Kent M Pitman <······@world.std.com> wrote:
>·····@designix.com.au (Brian Spilsbury) writes:
>
>> The types vector and the type list are disjoint subtypes of type
>> sequence, but are not necessarily an exhaustive partition of
>> sequence."
>
>Yes, for better or worse, this is left to _vendor_ experimentation.
>
>A vendor, of course, can pass through experimentation capability to you.
>
>As the NBS rep pointed out early on in the standards process, it's not the
>role of a standards committee to do design.  We did it sometimes, but always
>as a last resort in order to achieve consensus when the options were in 
>conflict.  The first choice, though, is to have one or more vendors with a
>happy experience to report..
>
>So I'd work on convincing my vendor if I were you...

Brian *is* assuming the rule of vendor here i.e., SBCL hacker.

Tim
From: Stephan H.M.J. Houben
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <a8412m$9k1$1@news.tue.nl>
In article <·············@globalgraphics.com>, Pekka P. Pirinen wrote:
>> Basically then we would have strings which are UCS-4, UCS-2 and
>> Latin-1 restricted (internally, not visibly to users). [...]
>> Procedures like string-set! therefore might have to inflate (and
>> thus copy) the entire string if a value outside the range is stored.
>> But that's ok with me; I don't think it's a serious lose.
>
>I suppose that is a viable implementation strategy, but I don't think
>it's the right option.  The language should expose the range of string
>data types to the programmer, and let them choose, because the range
>of memory usage is just too great to sweep under the mat.  Also,
>having strings automatically reallocated means an extra indirection
>for access which cannot always be optimized away.

If you have more than one string type anyway, then you can have
both directly and indirectly represented strings. It is then
possible to arrange that any directly represented string can
be replaced with an indirectly represented string. Then,
arrange for the garbage collector to remove all indirections.

Again, this is not that more complex once you have decided to
go for multiple string types anyway. Moreover, it is
completely transparent to the programmer and it can provide
other useful features, e.g. growing of strings. Indeed, it is
even possible for the implementation to dynamically decide to
overallocate storage once a string has been grown, so that
naively building a string character-by-character will be
O(n).

all this adds implementation complexity, but it makes string handling
much easier on the programmer.

To go even further: one could provide lazy string copying with
copy-on-write, optimised string concatenation in which 
substrings are shared, and since the OP wants to replace files
by strings, he could even consider to have the GC dynamically
compress and uncompress large strings.

OK, this is really overengineered, but anyway...

Greetings,

Stephan
>
>I note that offering multiple string types is exactly what all the CL
>implementations seem to have done.  This doesn't preclude having
>features that automatically select the smallest feasible type, e.g.,
>for "" read syntax or a STRING-APPEND function.
>-- 
>Pekka P. Pirinen
>The gap between theory and practice is bigger in practice than in theory.
From: Thomas Bushnell, BSG
Subject: Re: Back to character set implementation thinking
Date: 
Message-ID: <87sn6h2rn1.fsf@becket.becket.net>
········@wsan03.win.tue.nl (Stephan H.M.J. Houben) writes:

> To go even further: one could provide lazy string copying with
> copy-on-write, optimised string concatenation in which 
> substrings are shared, and since the OP wants to replace files
> by strings, he could even consider to have the GC dynamically
> compress and uncompress large strings.

I don't know about compressing (though it's not a bogus idea).  Doing
lazy sharing by copy-on-write is certainly a good approach for large
strings, and that will probably be a necessary feature of the system
to make various user-interface tweaks work right.  Thanks for the
idea.
From: Matthias Blume
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <fohen4eej1.fsf@trex10.cs.bell-labs.com>
Erik Naggum <····@naggum.net> writes:

> * Matthias Blume <········@shimizu-blume.com>
> | Oh, how I'd *love* to live in a world where Erik Naggum is God... :-)
> 
>   Yeah, me too.

I was under the impression that you thought you already did. :-)

>   Then I could force you to pay attention to the premises
>   that start a discussion instead of completely ignoring the context.
>   Please see <················@naggum.net>, and pay particular attention to
>   what Thomas Bushnell wrote.

To be frank, I do not care *one bit* about what this discussion was
originally about.  I was merely commenting on your claim about
capitalization being "incidental".  The debate of whether or not
case-sensitive identifiers in programming languages are Good or Evil,
or which character set design use up more bits than others, etc., bore
me.

Matthias
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226091723135094@naggum.net>
* Matthias Blume
| I was under the impression that you thought you already did. :-)

  Wipe that moronic grin off your face, dimwit.  What your retarded
  impression of other people might be should not concern anybody else.
  Such despicably stupid behavior should have been punished by people who
  cared about you.  Why have they not?

| To be frank, I do not care *one bit* about what this discussion was
| originally about.

  Of course not.  Moronic grins are a pretty strong indicator of impaired
  mental capacity, starting with the sheer inability to take other people
  seriously.

| I was merely commenting on your claim about capitalization being
| "incidental".  The debate of whether or not case-sensitive identifiers in
| programming languages are Good or Evil, or which character set design use
| up more bits than others, etc., bore me.

  I tried to suggest _strongly_ that you should go back to daytime TV, but
  did you get it?  No.  How amazingly dense you must be.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Kent M Pitman
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <sfwbsdcil40.fsf@shell01.TheWorld.com>
Matthias Blume <········@shimizu-blume.com> writes:

> Erik Naggum <····@naggum.net> writes:
> 
> > * Matthias Blume <········@shimizu-blume.com>
> > | Oh, how I'd *love* to live in a world where Erik Naggum is God... :-)
> > 
> >   Yeah, me too.
> 
> I was under the impression that you thought you already did. :-)
> 
> >   Then I could force you to pay attention to the premises
> >   that start a discussion instead of completely ignoring the context.
> >   Please see <················@naggum.net>, and pay particular attention to
> >   what Thomas Bushnell wrote.
> 
> To be frank, I do not care *one bit* about what this discussion was
> originally about.  I was merely commenting on your claim about
> capitalization being "incidental".  The debate of whether or not
> case-sensitive identifiers in programming languages are Good or Evil,
> or which character set design use up more bits than others, etc., bore
> me.

Capitalization _is_ incidental.  It is ceremonially marked in written
text, but my impression based on a basic knowledge of linguistics and
a casual outside view of German [I don't purport to speak the
langauge] is that German people may claim that "weg" and "Weg" are
different words, but the capitalization is not pronounced audibly, so
there is generally enough contextual information to disambiguate in
speech.  Certainly this is the case for English situations like "God
loves you." and "The god loves you."  These are different words, God.
One is a proper name and one isn't.  But if it were miscapitalized
"god loves you" or "The God loves you".  It is possible for there to
be ambiguity in spite of this in some cases, but it's also possible to
have ambiguity in the case of correct case, too.  Human language is
not precise.  But normally where a confusion is common, some audible
notation arises to disambiguate.  And, incidentally, the audible
notation is [to my knowledge] never the addition of the word
"uppercase" or "lowercase" because that just isn't the issue in play.
It's usually the addition of a guide word, a case marking, a
determiner, etc.
From: Matthias Blume
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <fod6xsebeo.fsf@trex10.cs.bell-labs.com>
Kent M Pitman <······@world.std.com> writes:

> [ ... ] outside view of German [I don't purport to speak the
> langauge] is that German people may claim that "weg" and "Weg" are
> different words, but the capitalization is not pronounced audibly,

The two words are pronounced very differently.

> so there is generally enough contextual information to disambiguate in
> speech.

Ok, so everything that can be inferred from context is "incidental"
then?  Most spelling mistakes can be inferred from context, so should
we make programming languages tolerate them?  (It has been tried, as
you know.)

Anyway, this whole debate is supremely silly, IMHO.  Fortunately
neither you nor Erik get to dictate the rules, at least not for those
languages that I speak or program in...

Matthias
From: Kent M Pitman
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <sfwy9ggjvhh.fsf@shell01.TheWorld.com>
Matthias Blume <········@shimizu-blume.com> writes:

> Kent M Pitman <······@world.std.com> writes:
> 
> > [ ... ] outside view of German [I don't purport to speak the
> > langauge] is that German people may claim that "weg" and "Weg" are
> > different words, but the capitalization is not pronounced audibly,
> 
> The two words are pronounced very differently.
> 
> > so there is generally enough contextual information to disambiguate in
> > speech.
> 
> Ok, so everything that can be inferred from context is "incidental"
> then?  Most spelling mistakes can be inferred from context, so should
> we make programming languages tolerate them?  (It has been tried, as
> you know.)

Please read Aristotle on Virtue Ethics.  The mean between unreasonable 
extremes is not something with a fixed answer.  The fact that its precise
point in design space is not uniquely determined does not mean it should 
not be something people strive for.  If anyone seriously wants to defend
spelling errors as a good design theory, we could have a discussion about
it.  Otherwise, it's a pointless red herring.  I do, however, contend a
theory behind the point of view CL has, and was merely describing that
point of view.
 
> Anyway, this whole debate is supremely silly, IMHO.  Fortunately
> neither you nor Erik get to dictate the rules, at least not for those
> languages that I speak or program in...

We aren't dictating rules, and I personally don't really appreciate this
attempt to recast my defense of an arbitrary but reasonable design choice
into some sort of attempt at an ignorant attempt to control the world.

All we have done is to try to explain the present state of affairs based
on an attempt for harmony with something people do with a great deal of 
statistical regularity.  Probably there is no deed that everyone does with
any predictability other than, as they say, death and taxes, but it seems
inappropriate to base design on the idea that this implies no other 
large scale regularities worth checking into...
From: Thomas Bushnell, BSG
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <87n0wwa0ug.fsf@becket.becket.net>
Kent M Pitman <······@world.std.com> writes:

> Please read Aristotle on Virtue Ethics.  The mean between unreasonable 
> extremes is not something with a fixed answer.  

It can also only be determined by the man with a particular virtue
known as "practical wisdom", as well.  And, with practical wisdom,
comes all the virtues, not just one or two.  Which means that only the
person with true virtue is even able to tell what the Right Thing to
do is.

Aristotle's talk of a "mean" is a metaphor, of course.  It's some kind
of balance, some kind of "just enough" notion.

Some medievals liked to poo poo this by taking it overliterally, with
a rather snide attack.  Thomas Aquinas, however, liked the "mean"
theory, and here's how he treats of the snide attackers (from the
"Quastio disputata de virtutibus in communi", Article 13, Objection 7
and the response):

  Whether virtue lies in a mean.  It seems not....Boethius in "On
  arithmetic" speaks of a threefold mean, the arithmetical, as 6
  between 4 and 8 which is an equal distance from both, and the
  geometrical, as 6 between 9 and 4, which is proportionally the same
  distance from both, and the harmonic or musical mean, as 3 between 6
  and 2 because there is the same proportion of one extreme to the
  other, namely, 3 (which is the different between 6 and 9) to 1 which
  is the difference between 2 and 3.  But none of these means is found
  in virtue, since the mean of virtue does not relate equally to
  extremes, nor in a quantitative way nor according to some proportion
  of the extremes and differences.  Therefore, virtue does not lie in
  the mean.

  [replies Thomas]: It should be said that the means spoken of by
  Boethius lie in things and thus are not relevant to the mean of
  virtue which is determined by reason.  Justice seems to be an
  exception since it involves both a mean in things and another
  according to reason: The arithmetical mean is relevant to exchange
  and the geometrical to distribution, as is clear from [Aristotle's
  Nicomachean] Ethics [book] 5.

Anyway, I'd recommend the Nicomachean Ethics of Aristotle to anyone
interested in thinking.  You'll find it aggravating; he's quite
unmodern and actually quite bogus in a lot of ways, but he is truly
important and it will change a great deal about how you think, if you
take it seriously. 

Thomas
From: Matthias Blume
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <fo8z8ge7gy.fsf@trex10.cs.bell-labs.com>
Kent M Pitman <······@world.std.com> writes:

> We aren't dictating rules, and I personally don't really appreciate this
> attempt to recast my defense of an arbitrary but reasonable design choice
> into some sort of attempt at an ignorant attempt to control the world.

Sorry, I was unreasonably hash on you, Kent.

> All we have done is to try to explain the present state of affairs based
> on an attempt for harmony with something people do with a great deal of 
> statistical regularity.

As I have tried to point out, this sort of regularity isn't actually
quite as regular as some try to make it.  The Japanese language is a
great example (although there the distiction is not called "uppercase vs.
lowercase").

By the way, here is an example in a case-sensitive natural language
where the distinction between uppercase and lowercase gets
*pronounced*: "mit" vs. "MIT" in German.  The first means "with" and is
pronounced like "mitt", the second is the Massachussetts Institute of
Technology and is pronounced like speakers of English would pronounce
it: em-ay-tee.  I think that there are enough examples of this around
so that making a distinction between uppercase and lowercase is
warranted in the natural language case.  Again, I do not think that
this needs to be in any way correlated with the PL case.

Matthias
From: Pierre R. Mai
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <87663kpbjd.fsf@orion.bln.pmsf.de>
Matthias Blume <········@shimizu-blume.com> writes:

> By the way, here is an example in a case-sensitive natural language
> where the distinction between uppercase and lowercase gets
> *pronounced*: "mit" vs. "MIT" in German.  The first means "with" and is
> pronounced like "mitt", the second is the Massachussetts Institute of
> Technology and is pronounced like speakers of English would pronounce
> it: em-ay-tee.  I think that there are enough examples of this around

This is "supremely silly", if there is such a thing, even ignoring for
the time that MIT is neither a german word, nor a german abbreviation,
and that probably a large number of german speakers will not recognize
MIT as standing for "the" MIT, nor pronounce it as speakers of English
would.  The different pronounciation of mit vs. MIT doesn't result
from the difference in case, at all.  If you receive a telex that
informs you of an invitation to "the mit", you will pronounce "mit"
just as you would "MIT".  qed.

Of course that doesn't mean that case should be completely ignored, it
just means that case is just another attribute of text, like fonts,
and that there is little reason to encode it in the character.

It also means that you want to distinguish between mit (with) and MIT
(the institute) not based on case, but based on packages, i.e.

(and (not (eq 'german-words:mit 'universities:mit))
     ;; And now an example where case will not help in disambiguation
     ;; namely the sequence "tub", standing for both the english word
     ;; tub and the common abbreviation for the Technische Universit�t
     ;; Berlin
     (not (eq 'english-words:tub 'universities:tub)))

Regs, Pierre.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein
From: Nils Goesche
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <a7petj$n2u71$1@ID-125440.news.dfncis.de>
In article <··············@orion.bln.pmsf.de>, Pierre R. Mai wrote:
> Matthias Blume <········@shimizu-blume.com> writes:
> 
>> By the way, here is an example in a case-sensitive natural language
>> where the distinction between uppercase and lowercase gets
>> *pronounced*: "mit" vs. "MIT" in German.  The first means "with" and is
>> pronounced like "mitt", the second is the Massachussetts Institute of
>> Technology and is pronounced like speakers of English would pronounce
>> it: em-ay-tee.  I think that there are enough examples of this around
> 
> This is "supremely silly", if there is such a thing, even ignoring for
> the time that MIT is neither a german word, nor a german abbreviation,
> and that probably a large number of german speakers will not recognize
> MIT as standing for "the" MIT, nor pronounce it as speakers of English
> would.  The different pronounciation of mit vs. MIT doesn't result
> from the difference in case, at all.  If you receive a telex that
> informs you of an invitation to "the mit", you will pronounce "mit"
> just as you would "MIT".  qed.

I agree that the MIT example is silly, but there are much better ones.
Compare

``Der Philosoph fuehlt sich im allgemeinen wohl.''

with

``Der Philosoph fuehlt sich im Allgemeinen wohl.''

In speech, you can tell the difference because in the latter case
the main accent is on ``Allgemeinen'', whereas in the former it
is on ``wohl''.  Incidentally, the totally moronic ``spelling
reform'' that happened a few years ago breaks this example,
like numerous others, but fortunately at least my favorite
newspaper continues to use the old spelling.

Regards,
-- 
Nils Goesche
"Don't ask for whom the <CTRL-G> tolls."

PGP key ID 0x42B32FC9
From: Torsten
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <a7qssh$1ddr$1@news.cybercity.dk>
Nils Goesche <······@cartan.de> skrev:

> Compare ``Der Philosoph fuehlt sich im allgemeinen wohl.'' with
> ``Der Philosoph fuehlt sich im Allgemeinen wohl.'' In speech,
> you can tell the difference because in the latter case the
> main accent is on ``Allgemeinen'', whereas in the former it
> is on ``wohl''. Incidentally, the totally moronic ``spelling
> reform'' that happened a few years ago breaks this example,
> like numerous others, but fortunately at least my favorite
> newspaper continues to use the old spelling.

Written Danish used to have similar capitalization rules as
German, but that was changed in a spelling reform in 1948. It
took quite some time, about twenty years in fact, before the
last newspaper had started using all the things introduced in
the reform (capitalization wasn't the only change; a new letter
was added to the alphabet as well). There was no shortage of
arguments similar to the one you presented above. Even claims to
the effect that the language had been ruined. You don't hear them
much anymore, if at all. Most people quickly found out that the
arguments put forth in defense of the old system were strawmen.
The capitalization of nouns really did turn out to be just an
incidental accident of history that served no real purpose beyond
the purely aesthetic.

-- 
Torsten
From: Nils Goesche
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <874rj30wzk.fsf@darkstar.cartan>
Torsten <·····@fraqz.archeron.dk> writes:

> The capitalization of nouns really did turn out to be just an
> incidental accident of history that served no real purpose beyond
> the purely aesthetic.

I don't know if it is of any use in Danish, I don't speak Danish.
But I have /read/ texts that didn't use capitalization in German,
and it was very annoying in that it just makes harder to guess
how a sentence is likely to end, something that is very important
in German (the verb at the end...).  You don't grasp the
structure of a sentence as easily.  Sure, that doesn't mean
capitalization is /necessary/, except for cases like the one I
posted before; but if it simply makes reading a bit easier, I
don't want to miss it.  This has been measured, BTW.

Regards,
-- 
Nils Goesche
Ask not for whom the <CONTROL-G> tolls.

PGP key ID 0xC66D6E6F
From: Torsten
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <a7sla6$tj4$1@news.cybercity.dk>
Nils Goesche <···@cartan.de> skrev:

> But I have /read/ texts that didn't use capitalization in
> German, and it was very annoying in that it just makes harder
> to guess how a sentence is likely to end, something that is
> very important in German (the verb at the end...). [...] This
> has been measured, BTW.

I hope you can see the obvious flaw in such measurements. There
is no large German speaking group not trained to capitalize
nouns.

-- 
Torsten
From: Nils Goesche
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <87663gjx0u.fsf@darkstar.cartan>
Torsten <·····@fraqz.archeron.dk> writes:

> Nils Goesche <···@cartan.de> skrev:
> 
> > But I have /read/ texts that didn't use capitalization in
> > German, and it was very annoying in that it just makes harder
> > to guess how a sentence is likely to end, something that is
> > very important in German (the verb at the end...). [...] This
> > has been measured, BTW.
> 
> I hope you can see the obvious flaw in such measurements. There
> is no large German speaking group not trained to capitalize
> nouns.

Well, duh.  Giving up the extra effort of capitalization isn't
exactly something you have to train anybody for.  I have read
several *books* in German that didn't use capitalization at all.
It was horrible.  I don't have much time in the morning, but
still manage to read large parts of the Frankfurter Allgemeine
every morning, in very little time.  I sometimes ``observe''
myself how I read that fast, and I found out that when looking at
a whole block of text at a time, the visual structure of
sentences indicated by capitalized words is a very useful help
for the reader.

That's the whole point of capitalization.  It makes reading
easier, in German, anyway.  The /only/ argument I /ever/ heard
against capitalization was that it supposedly makes /writing/
easier for some retarded children.  Well, even if that were true,
which it isn't, who would want to read anything written by
retarded children, anyway?  Why do you write something down in
the first place, if it wasn't for somebody to read?  It's the
reader that counts, not the writer.

I don't know Danish.  Maybe it doesn't make a difference there.
Maybe Danes only used to capitalize nouns because the Germans
did, and as the Germans weren't exactly very popular in 1948,
that might have been a good opportunity to give up on it.

Or maybe it wasn't.  Who is supposed to know anymore?  You said
there was a controversy about it; maybe the people who were
against it were right?  Who would remember?  How could you tell?
When you lose a piece of culture like that, later generations
don't remember or miss any of it anymore, but that doesn't mean
it was right to drop it.

Suppose all the governments in the world suddenly decide to put
an end to this babylonian mess of programming languages and make
it a law that from now on, you are only allowed to program in,
say, Java.  We'd hate that and complain, but would be arrested
and put into concentration camps until we either learn and
publicly announce how great Java actually is, and how sorry we
are for not recognizing that earlier, or are given the coup de
grace if too stubborn.

What would happen after a few decades or so?  I tell you what:
People would be happy.  They'd laugh about us crazy freaks who
were too stupid to recognize the merits of their progressive,
modern ideas.  Nobody would remember any of the old languages,
and how would young people know that they were worth anything, if
every history book tells them that they were stupid and
anti-modern?  Everybody can see that they can write everything in
Java, so why should they miss anything?

Regards,
-- 
Nils Goesche
Ask not for whom the <CONTROL-G> tolls.

PGP key ID 0xC66D6E6F
From: Julian Stecklina
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <877knwuzfq.fsf@blitz.comp.com>
Nils Goesche <···@cartan.de> writes:

> Torsten <·····@fraqz.archeron.dk> writes:
> 
> > Nils Goesche <···@cartan.de> skrev:
> > > But I have /read/ texts that didn't use capitalization in
> > > German, and it was very annoying in that it just makes harder
> > > to guess how a sentence is likely to end, something that is
> > > very important in German (the verb at the end...). [...] This
> > > has been measured, BTW.

[...]

hmm... I have been speaking and writing in German quite a while now
and my knowledge of grammar says that the predicate is at the second
place in the main clause. Only its infinite part goes to the end.

> What would happen after a few decades or so?  I tell you what:
> People would be happy.  They'd laugh about us crazy freaks who
> were too stupid to recognize the merits of their progressive,
> modern ideas.  Nobody would remember any of the old languages,
> and how would young people know that they were worth anything, if
> every history book tells them that they were stupid and
> anti-modern?  Everybody can see that they can write everything in
> Java, so why should they miss anything?

That reminds me of Paul Graham saying that when he was using BASIC
which at his time did not support recursion, he never needed
recursion, as he did not know that it existed.
And writing a long sentence in English reminds me how I love commas in
German to make reading easier. ;)

Regards,
Julian
-- 
Meine Hompage: http://julian.re6.de

Ich suche eine PCMCIA v1.x type I/II/III Netzwerkkarte.
Ich biete als Tauschobjekt eine v2 100MBit Karte in OVP.
From: Nils Goesche
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <87wuvwibqo.fsf@darkstar.cartan>
Julian Stecklina <··········@web.de> writes:

> Nils Goesche <···@cartan.de> writes:
> 
> > Torsten <·····@fraqz.archeron.dk> writes:
> > 
> > > Nils Goesche <···@cartan.de> skrev:
> > > > But I have /read/ texts that didn't use capitalization in
> > > > German, and it was very annoying in that it just makes harder
> > > > to guess how a sentence is likely to end, something that is
> > > > very important in German (the verb at the end...). [...] This
> > > > has been measured, BTW.
> 
> hmm... I have been speaking and writing in German quite a while now
> and my knowledge of grammar says that the predicate is at the second
> place in the main clause. Only its infinite part goes to the
> end.

Anscheinend will er mich einfach nicht verstehen.  Hat er
wirklich kein Beispiel, kein einziges Beispiel, nicht einmal nach
langem Sinnen ueber mein wundervolles Posting, dafuer gefunden?
(Yes, that's what I meant)

> > What would happen after a few decades or so?  I tell you what:
> > People would be happy.  They'd laugh about us crazy freaks who
> > were too stupid to recognize the merits of their progressive,
> > modern ideas.  Nobody would remember any of the old languages,
> > and how would young people know that they were worth anything, if
> > every history book tells them that they were stupid and
> > anti-modern?  Everybody can see that they can write everything in
> > Java, so why should they miss anything?
> 
> That reminds me of Paul Graham saying that when he was using BASIC
> which at his time did not support recursion, he never needed
> recursion, as he did not know that it existed.

Like that.

> And writing a long sentence in English reminds me how I love commas in
> German to make reading easier. ;)

Me too, hehe :-)

Regards,
-- 
Nils Goesche
Ask not for whom the <CONTROL-G> tolls.

PGP key ID 0xC66D6E6F
From: Torsten
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <a81d1j$25jq$1@news.cybercity.dk>
Nils Goesche <···@cartan.de> skrev:

> Well, duh.  Giving up the extra effort of capitalization isn't
> exactly something you have to train anybody for.  I have read
> several *books* in German that didn't use capitalization at all.
> It was horrible.

I am talking about the measurement results. You claimed that the
capitalization makes reading easier. But what was measured? The
ability to read something that differ from the conventional way
of writing German in comparison to the way eveybody and his dog
learned in school. Somehow the result isn't all that surprising.
Where was the control group? The people who had grown up using
only non-capitalized German. How do you think they would fare in
such a test if they existed? They would have had no preconceived
ideas about non-capitalization looking weird.

> I don't have much time in the morning, but still manage to read
> large parts of the Frankfurter Allgemeine every morning, in
> very little time. I sometimes ``observe'' myself how I read
> that fast, and I found out that when looking at a whole block
> of text at a time, the visual structure of sentences indicated
> by capitalized words is a very useful help for the reader.

Most likely because that's what you are used to.

> I don't know Danish. [...] You said there was a controversy
> about it; maybe the people who were against it were right?

The most vocal opponents were the kind of people who always think
the world is coming to an end if anything changes. They are now
spending their energy on how to, or not to, place commas. In
general, they are surprisingly clueless about the subjects they
make sarcastic remarks about.

-- 
Torsten
From: Matthias Blume
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <fo3cyjfm6j.fsf@trex10.cs.bell-labs.com>
Torsten <·····@fraqz.archeron.dk> writes:

> Nils Goesche <···@cartan.de> skrev:
> 
> > Well, duh.  Giving up the extra effort of capitalization isn't
> > exactly something you have to train anybody for.  I have read
> > several *books* in German that didn't use capitalization at all.
> > It was horrible.
> 
> I am talking about the measurement results. You claimed that the
> capitalization makes reading easier. But what was measured? The
> ability to read something that differ from the conventional way
> of writing German in comparison to the way eveybody and his dog
> learned in school. Somehow the result isn't all that surprising.
> Where was the control group? The people who had grown up using
> only non-capitalized German. How do you think they would fare in
> such a test if they existed? They would have had no preconceived
> ideas about non-capitalization looking weird.

I can give you an example from a learner of Japanese: The use of Kanji
(Chinese characters) in place of their hiragana (phonetic)
counterparts is just as redundant as capitalization is in German.  A
literate speaker of Japanese can easily read text that is written
entirely in hiragana (although it might bother him or her a bit).  In
fact, historically, there was a time when women were not allowed to
write Kanji, so this had to be true more or less by definition.

There are on the order of 50 hiragana, but there are several thousands
of Kanji -- which means that learing just hiragana is immensely easier
than learning both.  According to the above, one would expect that
someone without prior exposure to either system would have an easier
time reading pure hiragana text.

I, having not been raised in Japan, fall into this category of having
no prior exposure.  But what can I tell you?  The moment I managed to
memorize even just a tiny number of Kanji, sentences that actually
used them (in place of their hiragana spellings) became *vastly*
easier to read for me.  I am not a psychologist or linguist, so I
won't speculate on why that is.

So if it were true that either way would be equally easy to read for
someone without prior training, why would an utterly untrained person
such as I (and pretty much all of my fellow students as well, BTW) see
this effect?  In other words, there is certainly more going on than
just a "trained dog effect".

> The most vocal opponents were the kind of people who always think
> the world is coming to an end if anything changes. They are now
> spending their energy on how to, or not to, place commas. In
> general, they are surprisingly clueless about the subjects they
> make sarcastic remarks about.

[ ... which brings us back to the topic of "ad hominems".
  Just because idiots or bigots defend something, that something does
  not have to be wrong.  (It, of course, does not mean that it is
  right either.) ]

Matthias
From: Dorai Sitaram
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <a822bj$o2s$1@news.gte.com>
In article <··············@trex10.cs.bell-labs.com>,
Matthias Blume  <········@shimizu-blume.com> wrote:
>
>There are on the order of 50 hiragana, but there are several thousands
>of Kanji -- which means that learing just hiragana is immensely easier
>than learning both.  According to the above, one would expect that
>someone without prior exposure to either system would have an easier
>time reading pure hiragana text.
>
>I, having not been raised in Japan, fall into this category of having
>no prior exposure.  But what can I tell you?  The moment I managed to
>memorize even just a tiny number of Kanji, sentences that actually
>used them (in place of their hiragana spellings) became *vastly*
>easier to read for me.  I am not a psychologist or linguist, so I
>won't speculate on why that is.
>
>So if it were true that either way would be equally easy to read for
>someone without prior training, why would an utterly untrained person
>such as I (and pretty much all of my fellow students as well, BTW) see
>this effect?  In other words, there is certainly more going on than
>just a "trained dog effect".


Do kanji perhaps serve as some sort of abbreviation,
or, I should rather say, syntactic abstraction?  If so,
their appeal may have the same reason as why
no-longer-newbie users of a programming language prefer
to extend the language with (their own or
others') procedural and textual abstractions rather
than sticking to core procedures and core syntax.  I'm
speculating only.  

(BTW, abbreviations I should note are not just for
saving space or "typing".  They actually aid
comprehension by reducing the time taken for cliches
that don't deserve that time, and correspondingly
letting the non-cliche part of the communication be
highlighted more.  Even electronic communcation,
where space is not expensive the same way as on paper,
and where "completion" aids abound, profits from
abbreviations.)

--d
From: Brian Spilsbury
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <f0f9d928.0203291102.55a722c4@posting.google.com>
····@goldshoe.gte.com (Dorai Sitaram) wrote in message news:<············@news.gte.com>...
> In article <··············@trex10.cs.bell-labs.com>,
> Matthias Blume  <········@shimizu-blume.com> wrote:
> >
> >So if it were true that either way would be equally easy to read for
> >someone without prior training, why would an utterly untrained person
> >such as I (and pretty much all of my fellow students as well, BTW) see
> >this effect?  In other words, there is certainly more going on than
> >just a "trained dog effect".
> 
> 
> Do kanji perhaps serve as some sort of abbreviation,
> or, I should rather say, syntactic abstraction?  If so,
> their appeal may have the same reason as why
> no-longer-newbie users of a programming language prefer
> to extend the language with (their own or
> others') procedural and textual abstractions rather
> than sticking to core procedures and core syntax.  I'm
> speculating only.  

No, kanji are not used for syntax, syntax in japanese is mediated via
hiragana/katakana modifiers, and particles (also represented in
hiragana, although 'wo' is distinct, and the particle 'ha' is
pronounced 'wa'.

Kanji give direct semantic forms. These semantic forms are distinct
from any prononciation, and the pronounciation of a particular
sequence of kanji is determined by the phonetic modifiers trailing it,
and/or the combination of a sequence of kanji though the On (chinese
derived) or Kun (japanese) readings which are not mixed in a given
sequence [a bit like composing latin with latin, and greek with
greek].

You might think of kanji as giving a particular root form.

[watakushi]-ha [ni][hon][go]-wo [ben][kyou] shite-imashita.
[I]-topic [japanese][language]-object [study] doing-was.

[boku]-ga basu-de [kou][kou]-e [i](ki)mashita.
[I]-subject bus-by [highschool]-to [go](phonetic kanji
modifier)-past-tense.

the [bracketed] forms are kanji in these examples.

my japanese is a bit rusty, so please excuse any error.

as a side note, a study found that quite different areas of the brain
are used to process the hiragana/katakana forms and the kanji forms,
and a different study found that reasonably severely dyslexic american
(english speaking) children were able to learn several hundred chinese
characters without undue difficulty, although they were unable to read
roman characters.

as a final note, a certain jesuit missionary declared that the
japanese written language was designed by the devil, and I think that
anyone who is familiar with it would be inclined to agree. ^^
From: Matthias Blume
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <m3u1qz9jw0.fsf@hanabi.research.bell-labs.com>
Just in case you care what Brian's examples looks like when rendered
in actual Kanji and hiragana (requires sufficient MIME support in
you newsreader):

> [watakushi]-ha [ni][hon][go]-wo [ben][kyou] shite-imashita.
> [I]-topic [japanese][language]-object [study] doing-was.

pure hiragana: $B$o$?$/$7$K$[$s$4$r$Y$s$-$g$&$7$F$$$^$7$?!#(B
with kanji:    $B;d$OF|K\8l$rJY6/$7$F$$$^$7$?!#(B

> [boku]-ga basu-de [kou][kou]-e [i](ki)mashita.
> [I]-subject bus-by [highschool]-to [go](phonetic kanji
> modifier)-past-tense.

pure hiragana: $B$\$/$,$P$9$G$3$&$3$&$X$$$-$^$7$?!#(B
with kanji:    $BKM$,%P%9$G9b9;$X9T$-$^$7$?!#(B

$B%^%F%#%"%9(B
From: Matthias Blume
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <m3k7rvt77s.fsf@hanabi.research.bell-labs.com>
Matthias Blume <········@shimizu-blume.com> writes:

> Just in case you care what Brian's examples looks like when rendered
> in actual Kanji and hiragana (requires sufficient MIME support in
> you newsreader):
> 
> > [watakushi]-ha [ni][hon][go]-wo [ben][kyou] shite-imashita.
> > [I]-topic [japanese][language]-object [study] doing-was.
> 
> pure hiragana: $B$o$?$/$7$K$[$s$4$r$Y$s$-$g$&$7$F$$$^$7$?!#(B

Oops, typo.  Should have been:

                 $B$o$?$/$7$O$K$[$s$4$r$Y$s$-$g$&$7$F$$$^$7$?!#(B

> > [boku]-ga basu-de [kou][kou]-e [i](ki)mashita.
> > [I]-subject bus-by [highschool]-to [go](phonetic kanji
> > modifier)-past-tense.
> 
> pure hiragana: $B$\$/$,$P$9$G$3$&$3$&$X$$$-$^$7$?!#(B
> with kanji:    $BKM$,%P%9$G9b9;$X9T$-$^$7$?!#(B

By the way, notice how "bus" (basu) is spelled in _katakana_.

Anyway, this is perhaps getting a bit off-topic... :-)

Matthias
From: Nils Goesche
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <87ofh7z6ua.fsf@darkstar.cartan>
····@goldshoe.gte.com (Dorai Sitaram) writes:

> In article <··············@trex10.cs.bell-labs.com>,
> Matthias Blume  <········@shimizu-blume.com> wrote:
> >
> >There are on the order of 50 hiragana, but there are several thousands
> >of Kanji -- which means that learing just hiragana is immensely easier
> >than learning both.  According to the above, one would expect that
> >someone without prior exposure to either system would have an easier
> >time reading pure hiragana text.
> >
> >I, having not been raised in Japan, fall into this category of having
> >no prior exposure.  But what can I tell you?  The moment I managed to
> >memorize even just a tiny number of Kanji, sentences that actually
> >used them (in place of their hiragana spellings) became *vastly*
> >easier to read for me.  I am not a psychologist or linguist, so I
> >won't speculate on why that is.
> 
> Do kanji perhaps serve as some sort of abbreviation,
> or, I should rather say, syntactic abstraction?  If so,
> their appeal may have the same reason as why
> no-longer-newbie users of a programming language prefer
> to extend the language with (their own or
> others') procedural and textual abstractions rather
> than sticking to core procedures and core syntax.  I'm
> speculating only.  

Kanji's often directly denote a certain meaning of a word,
they're like images.  When I ask my wife about the meaning of a
Japanese word I'd heard or read (in Latin characters) somewhere,
she is usually helpless: She can't tell until she sees the kanji
sign.  Hiragana only describes the sound of a word, like our
Latin characters.  She told me that sometimes Japanese would
actually draw a kanji sign in the air when talking, to indicate
what meaning of a word they're saying is intended.  For instance,
there is a kanji sign that has one and only one meaning: Kant's
notion of ``category'' :-)

Regards,
-- 
Nils Goesche
Ask not for whom the <CONTROL-G> tolls.

PGP key ID 0xC66D6E6F
From: Takehiko Abe
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <keke-3003020217280001@solg4.keke.org>
In article <··············@trex10.cs.bell-labs.com>, Matthias Blume <········@shimizu-blume.com> wrote:

> [...] The moment I managed to
> memorize even just a tiny number of Kanji, sentences that actually
> used them (in place of their hiragana spellings) became *vastly*
> easier to read for me.  I am not a psychologist or linguist, so I
> won't speculate on why that is.

_One_ reason is that Japanese does not use white spaces to delimit
the words. So, all hiragana text will be felt like reading

   MakeLoadFormSavingSlots

instead of

   make-load-form-saving-slots

-- 
<keke at mac com>
Are you sure that sound might want to have an idiot?
From: Matthias Blume
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <foy9gbdylm.fsf@trex10.cs.bell-labs.com>
····@ma.ccom (Takehiko Abe) writes:

> In article <··············@trex10.cs.bell-labs.com>, Matthias Blume <········@shimizu-blume.com> wrote:
> 
> > [...] The moment I managed to
> > memorize even just a tiny number of Kanji, sentences that actually
> > used them (in place of their hiragana spellings) became *vastly*
> > easier to read for me.  I am not a psychologist or linguist, so I
> > won't speculate on why that is.
> 
> _One_ reason is that Japanese does not use white spaces to delimit
> the words.

Right.

> So, all hiragana text will be felt like reading
> 
>    MakeLoadFormSavingSlots

Wouldn't it be more like

     makeloadformsavingslots

?

Matthias
From: Hartmann Schaffer
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3ca546b8@news.sentex.net>
In article <··············@trex10.cs.bell-labs.com>,
	Matthias Blume <········@shimizu-blume.com> writes:
> ...
> There are on the order of 50 hiragana, but there are several thousands
> of Kanji -- which means that learing just hiragana is immensely easier
> than learning both.  According to the above, one would expect that
> someone without prior exposure to either system would have an easier
> time reading pure hiragana text.
> 
> I, having not been raised in Japan, fall into this category of having
> no prior exposure.  But what can I tell you?  The moment I managed to
> memorize even just a tiny number of Kanji, sentences that actually
> used them (in place of their hiragana spellings) became *vastly*
> easier to read for me.  I am not a psychologist or linguist, so I
> won't speculate on why that is.

does japanes have many homophones?  from what i remember having read a
while ago, the kanji characters quite often are taken over from
chinese to designate the japanese word for the chinese word the
character was developed for.  if the language is rich in homophones,
this would help distinguish between identically sounding words with
totally different meanings

hs

-- 

don't use malice as an explanation when stupidity suffices
From: Hartmann Schaffer
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3ca544d1@news.sentex.net>
In article <·············@news.cybercity.dk>,
	Torsten <·····@fraqz.archeron.dk> writes:
> ...
>> I don't have much time in the morning, but still manage to read
>> large parts of the Frankfurter Allgemeine every morning, in
>> very little time. I sometimes ``observe'' myself how I read
>> that fast, and I found out that when looking at a whole block
>> of text at a time, the visual structure of sentences indicated
>> by capitalized words is a very useful help for the reader.
> 
> Most likely because that's what you are used to.
> 
>> I don't know Danish. [...] You said there was a controversy
>> about it; maybe the people who were against it were right?
> 
> The most vocal opponents were the kind of people who always think
> the world is coming to an end if anything changes. They are now
> spending their energy on how to, or not to, place commas. In
> general, they are surprisingly clueless about the subjects they
> make sarcastic remarks about.

i remember a feuilleton article/editorial in the frankfurter
allgemeine quite a while ago (around 1970).  it was during one of the
ever returning discussions in the german language area about a reform
of the capitalisation rules and their reform.  in the article, the
author suggested that they should not be simplified, since some of the
more obscure rules would help distinguish between netter and less well
educated persons.  iirc, he didn't forward his position tongue in
cheek

hs

-- 

don't use malice as an explanation when stupidity suffices
From: Torsten
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <a82vqn$18jd$1@news.cybercity.dk>
Nils Goesche <···@cartan.de> wrote:

> I don't know Danish.  Maybe it doesn't make a difference there.
> Maybe Danes only used to capitalize nouns because the Germans
> did, and as the Germans weren't exactly very popular in 1948,
> that might have been a good opportunity to give up on it.

The capitalization tradition had the same origin as the German
one. It started as a fad among printers back when printing was a
relatively new craft. It is also true that the lack of popularity
of anything German in the late forties made it possible to
pass the bill that put the spelling reform into effect, but it
wasn't the reason for doing it, it just provided the necessary
leverage in the general public. The idea goes back to the 19th
century, but was for a long time met with scorn, partly due to
inertia and conservatism and partly due to skepticism. Why change
something that has worked well for centuries? Well, it hadn't
worked well. Many people couldn't figure out which words should
be capitalized. That was the real reason for the reform. The
main arguments against it invariably consisted of examples, like
the ones you gave in German, where the capitalization eliminates
ambiguity. In isolation, that was - and is - correct, but what
the critics overlooked is that sentences don't occur in total
isolation. They are always part of a larger conversational
context, a discourse, and that is what implicitly removes the
ambiguity.

> Or maybe it wasn't. Who is supposed to know anymore? You said
> there was a controversy about it; maybe the people who were
> against it were right? Who would remember? How could you tell?

The discussions ended decades ago when people realized that
nothing really had been lost and the new system made it easier
for people with a more modest formal knowledge of grammar to
write in a way that reasonably conforms to the official norm. A
net gain for everybody. And, even today, there are still quite a
few people around who originally learned the old system.

I think we should end this subthread new as it has nothing to do
with Lisp anymore.

-- 
Torsten
From: Nils Goesche
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <87eli2c308.fsf@darkstar.cartan>
Torsten <·····@fraqz.archeron.dk> writes:

> I think we should end this subthread new as it has nothing to do
> with Lisp anymore.

So be it.  I disagree of course, but so what.  I've made my
point, you've made your point, everything is clear.

Regards,
-- 
Nils Goesche
Ask not for whom the <CONTROL-G> tolls.

PGP key ID 0xC66D6E6F
From: Torsten
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <a8410o$2pqe$1@news.cybercity.dk>
Nils Goesche <···@cartan.de> wrote:

> So be it.  I disagree of course, but so what.  I've made my
> point, you've made your point, everything is clear.

Yup :)

Have fun,
-- 
Torsten
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226099447876195@naggum.net>
* Matthias Blume
| Sorry, I was unreasonably hash on you, Kent.

  You are a clever little asshole, aren't you?

| By the way, here is an example in a case-sensitive natural language where
| the distinction between uppercase and lowercase gets *pronounced*: "mit"
| vs. "MIT" in German.  The first means "with" and is pronounced like
| "mitt", the second is the Massachussetts Institute of Technology and is
| pronounced like speakers of English would pronounce it: em-ay-tee.

  Geez, dude, you are _so_ full of yourself.  No wonder you think this is
  supremely silly -- your own contributions are ludicrous and stupid.

  Whether the M, I, and T of the words that make up "MIT" are capitalized
  or not is incidental.  That one chooses to uppercase initials of words
  is precisely what I am talking about.  Sheesh, some people.

| I think that there are enough examples of this around so that making a
| distinction between uppercase and lowercase is warranted in the natural
| language case.

  Hello?  Of course these is a _distinction_ you incredibly retarded jerk!
  Have you been arguing for a _distinction_?  Man, how can you survive
  being so goddamn _stupid_?  Nobody has argued against a distinction, you
  insufferably arrogant moron.  The point is how it should be REPRESENTED!
  (Incidental capitalization added purely for effect.)  Is it even possible
  to be so unintelligent that this is not something you could have avoided
  by _thinking_ a little?  Of course, you are in this "you guys are silly"
  mode, so thinking on your own is out of the question, but the whole point
  is that you are so unconscious and so unwilling to engage your brain to
  understand what somebody else argues that you effectively reduce the
  discussion to your pathetically ignorant level.  Of _course_ there is a
  distinction!  Geez, you are such an idiot.  The question is: should that
  visible distinction have been coded to represent the incidental quality
  apart from the intrinsic quality, and the answer is so "advanced" that
  your puny little brain will in all likelihood not grasp its simplicity.

  Let me give your sevrely reduced mental capacity a simple enough example
  that you might actually be inspired to think about the ramifications.
  The symbol for �ngstr�m in Unicode is exactly the same as the glyph for
  the letter A with ring above, because the guy's name was spelled with
  that letter, just like Celsius and Fahrenheit, but all these three
  letters should never be lowercased even though they are upper-case
  letters.  This is an intrinsic quality.  For this reason, Unicode has
  chosen to represent them as _symbols_, not letters.  The same applies to
  Greek omega, pi, rho, and sigma, which are different symbols in each
  case.  Can you wrap your exceptionally pitiful brain around these few and
  simple examples to perhaps grasp that incidental qualities and intrinsic
  qualities are important?  Or are you so unphilosophical and such a
  leering idiot with a moronic grin permanently attached to his skull that
  being able to grasp what other people have thought about before you has
  become impossible for you?

  On wonder you think those who think are _gods_ in their own mind: If you
  had been able to think at all, you would probably experience _several_
  revelations of such magnitude that one "god" would not be enough.

| Again, I do not think that this needs to be in any way correlated with
| the PL case.

  Is the stuff you are smoking legal?  Go back to your Scheme community,
  where being supremely silly is not considered rude to your compatriots.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226098099655467@naggum.net>
* Matthias Blume
| The two words are pronounced very differently.

  But so is house and house, distinguished by a voiced and unvoiced s.
  Some languages also have tonemes, not just phonemes.  Norwegian is among
  them.  The phonemes of the Noreegian words for "farmers", "prayers" and
  "beans" are the same, but the tonemes differ.  Immigrants often have
  farmers for dinner and purchase produce directly from beans as a result.
  The word for "farmers" is spelled "b�nder" but "beans" and "prayers" are
  both spelled "b�nner".  Note that this is not a question of stress.  All
  three stress the first syllable exactly the same, and do not stress the
  final syllable.

| Anyway, this whole debate is supremely silly, IMHO.

  Then you are supremely silly who continue to post your drivel to it.

| Fortunately neither you nor Erik get to dictate the rules, at least not
| for those languages that I speak or program in...

  OF course, you are a Scheme freak and a tourist in comp.lang.lisp, the
  very canonicalization of the irresponsible trouble-maker who thinks he is
  an outsider to the community he torments with "you are silly who do it
  differently from me" attitudes.  Thank you for contributing to the
  _impression_ that Scheme is the language of choice of deranged lunatics.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Thomas Bushnell, BSG
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <87zo0wqd48.fsf@becket.becket.net>
Erik Naggum <····@naggum.net> writes:

>   Some languages also have tonemes, not just phonemes.  Norwegian is among
>   them.  The phonemes of the Noreegian words for "farmers", "prayers" and
>   "beans" are the same, but the tonemes differ.  Immigrants often have
>   farmers for dinner and purchase produce directly from beans as a result.
>   The word for "farmers" is spelled "b�nder" but "beans" and "prayers" are
>   both spelled "b�nner".  Note that this is not a question of stress.  All
>   three stress the first syllable exactly the same, and do not stress the
>   final syllable.

Huh?  If they are different words, then *by the definition of a
phoneme* the sound which distinguishes them is a phoneme.  What is a
"toneme"? 
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226104706223880@naggum.net>
* Thomas Bushnell, BSG
| Huh?  If they are different words, then *by the definition of a phoneme*
| the sound which distinguishes them is a phoneme.  What is a "toneme"?

  Stress is generally not considered to be a difference in phoneme.

  The sound is exactly the same, but whether you have entering, departing,
  rising, falling, high, low, up-down, down-up, or level tone can and does
  change the meaning of the word.  Thai, for instance, has explicit tone
  markers.  Chinese has different ideographs for words that are pronounced
  with the same phonemes and different tonemes.

  Consider the phonemes of the word "really".  The toneme is the difference
  in pronunciation between "Really?" and "Really." and "Really!".

  French, for instance, has no stress, but tends to use maringally shorter
  and longer vowels.  They also have no tonemes, so they French have very
  _serious_ problems dealing with other languages and sound ridiculous in
  almost every other language than their own.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Thomas Bushnell, BSG
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <874rj4uey3.fsf@becket.becket.net>
Erik Naggum <····@naggum.net> writes:

> * Thomas Bushnell, BSG
> | Huh?  If they are different words, then *by the definition of a phoneme*
> | the sound which distinguishes them is a phoneme.  What is a "toneme"?
> 
>   Stress is generally not considered to be a difference in phoneme.

Oh, ok.  That's a good point; the term "phoneme" is ambiguous I think.
Tonal differences are sometimes phonemic and sometimes not, but I now
understand what you mean.  Whether a tonal or length difference should
be officially phonemic is a matter style and not any real linguistics,
as far as I can tell.

>   Consider the phonemes of the word "really".  The toneme is the difference
>   in pronunciation between "Really?" and "Really." and "Really!".

Yeah, but there it's a matter of marking, which is different than
tone.  A better example in English is between homographs like
"conduct" (a noun, stress on the first syllable) and "conduct" (a
verb, stress on the second syllable).  

Because stress is contextual, it's not normally counted as a phoneme.
Tone and length are not contextual, so I think those are usually
counted as phonemes.  But (as I said above) I think this is a pretty
gray area.

>   French, for instance, has no stress, but tends to use maringally shorter
>   and longer vowels.  They also have no tonemes, so they French have very
>   _serious_ problems dealing with other languages and sound ridiculous in
>   almost every other language than their own.

Actually French does have stress as a word marker; the last syllable
of each word gets a stress.  (Obviously, stress is therefore not
phonemic in French.)

Thomas
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226112482576869@naggum.net>
* Thomas Bushnell, BSG
| Oh, ok.  That's a good point; the term "phoneme" is ambiguous I think.
| Tonal differences are sometimes phonemic and sometimes not, but I now
| understand what you mean.  Whether a tonal or length difference should be
| officially phonemic is a matter style and not any real linguistics, as
| far as I can tell.

  *sigh*� My native language has tonemes.  Yours does not.  Trust me on
  this, OK?  Go look it up if you doubt me.

  Tone is the musical tone with which you pronounce a phoneme, or more
  precisely, with the relative direction of the change of the tone
  throughout the word.

> Consider the phonemes of the word "really".  The toneme is the difference
> in pronunciation between "Really?" and "Really." and "Really!".

| Yeah, but there it's a matter of marking, which is different than tone.

  *sigh� No, this is a tone difference.  The rising tone at the end of a
  question is precisely this -- tone.  One does not usually talk about
  tonemes when dealing with the changing meaning of a sentence, but it is
  the same idea.

| A better example in English is between homographs like "conduct" (a noun,
| stress on the first syllable) and "conduct" (a verb, stress on the second
| syllable).

  No, that would be stress, not tone.  I was trying to give you an example
  of what tone is, not how the same sequence of phonemes can have different
  meaning in differing ways.

| Because stress is contextual, it's not normally counted as a phoneme.
| Tone and length are not contextual, so I think those are usually counted
| as phonemes.  But (as I said above) I think this is a pretty gray area.

  No, it is not a grey area.  It just does not apply to English.  Study
  Norwegian or Thai.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Thomas Bushnell, BSG
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <87wuvzpwzy.fsf@becket.becket.net>
Erik Naggum <····@naggum.net> writes:

> * Thomas Bushnell, BSG
> | Oh, ok.  That's a good point; the term "phoneme" is ambiguous I think.
> | Tonal differences are sometimes phonemic and sometimes not, but I now
> | understand what you mean.  Whether a tonal or length difference should be
> | officially phonemic is a matter style and not any real linguistics, as
> | far as I can tell.
> 
>   *sigh*� My native language has tonemes.  Yours does not.  Trust me on
>   this, OK?  Go look it up if you doubt me.

I'm trusting you about the way Norwegian works, and I'm trying to
understand it in the terminology used in English to speak about
linguistics.

I do understand perfectly well what tone is.

> | Because stress is contextual, it's not normally counted as a phoneme.
> | Tone and length are not contextual, so I think those are usually counted
> | as phonemes.  But (as I said above) I think this is a pretty gray area.
> 
>   No, it is not a grey area.  It just does not apply to English.  Study
>   Norwegian or Thai.

I know perfectly well what tone is.

The question is whether tonal difference is a phonemic difference.

Since a phoneme is a minimal unit distinguishing two words, if there
are two words that differ only in tone, the difference must therefore
be phonemic.

I mentioned stress (in English, with the "conduct" example), because
stress is also sometimes thought not to distinguish phonemes, but
really it does.

What is a gray area is whether how rigid one wants to be about the
definition of "phoneme".

Thomas
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226123116550822@naggum.net>
* Thomas Bushnell, BSG
| Since a phoneme is a minimal unit distinguishing two words, if there are
| two words that differ only in tone, the difference must therefore be
| phonemic.

  Apparently, this is how some people see it -- I have not seen a
  difference in tone referred to as "phonemic".  However, phonemes are
  supposed to be discrete elments of speech.  A toneme is not -- the change
  in tone usually spans several phonemes.  Therefore, it is either a
  phoneme of its own, which seems odd, or an additional speech element.
  If a "phoneme" is the _only_ smallest unit of sound it appears not
  to be possible to enumerate the phonemes of a language, any longer.

| I mentioned stress (in English, with the "conduct" example), because
| stress is also sometimes thought not to distinguish phonemes, but
| really it does.

  So when something, anything distinguishes phonemes, they become two?
  That does not appear to be useful.  It seems rather to mulitply them
  without bounds.

| What is a gray area is whether how rigid one wants to be about the
| definition of "phoneme".

  Seems if you can put whatever you want into to, it is rendered useless.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Thomas Bushnell, BSG
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <87wuvzcjhj.fsf@becket.becket.net>
Erik Naggum <····@naggum.net> writes:

> | What is a gray area is whether how rigid one wants to be about the
> | definition of "phoneme".
> 
>   Seems if you can put whatever you want into to, it is rendered useless.

That's one-bit thinking.  It's a gray area, not a rigid definition,
and I thank you for pointing out the complexities in the case of
Norwegian.
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226154911310660@naggum.net>
* Thomas Bushnell, BSG
| It's a gray area, not a rigid definition, and I thank you for pointing
| out the complexities in the case of Norwegian.

  Bare hyggelig!

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Russell Senior
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <86eli7ot2k.fsf@coulee.tdb.com>
>>>>> "Erik" == Erik Naggum <····@naggum.net> writes:

TB> It's a gray area, not a rigid definition, and I thank you for
TB> pointing | out the complexities in the case of Norwegian.

Erik>   Bare hyggelig!

Erik, I have _no_ interest in seeing your bare "hyggelig",
particularly if it is a "gray area".  You people disgust me! ;-)


-- 
Russell Senior         ``The two chiefs turned to each other.        
·······@aracnet.com      Bellison uncorked a flood of horrible       
                         profanity, which, translated meant, `This is
                         extremely unusual.' ''                      
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226238165159557@naggum.net>
* Erik Naggum
| Bare hyggelig!

* Russell Senior
| Erik, I have _no_ interest in seeing your bare "hyggelig",
| particularly if it is a "gray area".  You people disgust me! ;-)

  Ah, at last an explanation for why so many foreigners think we kind and
  gentle Norwegians are so rude.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Larry Clapp
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <4l0r7a.ui1.ln@rabbit.ddts.net>
In article <··············@becket.becket.net>, Thomas Bushnell, BSG wrote:
> Since a phoneme is a minimal unit distinguishing two words, if there are two
> words that differ only in tone, the difference must therefore be phonemic.

Could one classify a toneme as a subclass of phonemes?  More to the point, do
linguists?

-- L
From: Thomas Bushnell, BSG
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <87hen2dckr.fsf@becket.becket.net>
Larry Clapp <·····@theclapp.org> writes:

> In article <··············@becket.becket.net>, Thomas Bushnell, BSG wrote:
> > Since a phoneme is a minimal unit distinguishing two words, if there are two
> > words that differ only in tone, the difference must therefore be phonemic.
> 
> Could one classify a toneme as a subclass of phonemes?  More to the
> point, do linguists?

I don't know; the word "toneme" isn't in any dictionaries I had ready
access to when I checked.  The text I learned what linguistics I know
from only has phonemes, and mentions tonal differences as one kind of
phonemic separation.

Thomas
From: Rahul Jain
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <878z8e3e9d.fsf@photino.sid.rice.edu>
Larry Clapp <·····@theclapp.org> writes:

> Could one classify a toneme as a subclass of phonemes?  More to the point, do
> linguists?

I think the main problem is that a toneme can span multiple phonemes,
so a toneme cannot necessarily be describled as a type of phoneme.

-- 
-> -/                        - Rahul Jain -                        \- <-
-> -\  http://linux.rice.edu/~rahul -=-  ············@techie.com   /- <-
-> -/ "Structure is nothing if it is all you got. Skeletons spook  \- <-
-> -\  people if [they] try to walk around on their own. I really  /- <-
-> -/  wonder why XML does not." -- Erik Naggum, comp.lang.lisp    \- <-
|--|--------|--------------|----|-------------|------|---------|-----|-|
   (c)1996-2002, All rights reserved. Disclaimer available upon request.
From: Thomas Bushnell, BSG
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <87bsdablwq.fsf@becket.becket.net>
Rahul Jain <·····@sid-1129.sid.rice.edu> writes:

> Larry Clapp <·····@theclapp.org> writes:
> 
> > Could one classify a toneme as a subclass of phonemes?  More to
> > the point, do linguists?
> 
> I think the main problem is that a toneme can span multiple phonemes,
> so a toneme cannot necessarily be describled as a type of phoneme.

Yeah, I think this is central to the problem.

Does a toneme in Norwegian extend past a single syllable, however?  I
don't know the answer to that question.

In Classical Attic Greek, the accents marked tone as well as stress,
and were (mostly) phonemic.  But the accents were always marked over a
single vowel, and so you could distinguish not only the long eta and
the short epsilon, but each could have one of three different accents;
all distinctions which are basically not phonemic in English (though
we do have all those sounds).

The tones actually extend beyond just the vowel, and affect timing and
intonation of the whole word, however.  But they are assigned to the
stressed vowel only, and are counted as various phonemic variants of
that vowel.

The situation might work out similarly in Norwegian, dunno.

Thomas
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226238950860301@naggum.net>
* Thomas Bushnell, BSG
| Does a toneme in Norwegian extend past a single syllable, however?  I
| don't know the answer to that question.

  Basically, the whole word is either rising, falling, or rising-falling,
  and in combining words, the intonation of both words change.  For
  instance, English as a Second Language means something different from
  English as the Second Language.  In Norwegian we have "Norsk som andre
  sprog" og "Norsk som andresprog", where the former means either "like
  other langauges" or "as the second language" depending on tone, and is
  also distinguished from the latter by tone, not by stress.  This is
  particulary funny when those furriners try to find the section in the
  bookstore that would help them get just this point and doubly funny when
  the bookstore cannot even spell it correctly, which their all too young
  information desk attendant could not pick up from the tone difference
  even though several bystanders could, and laughed, when I tried in vain
  to point out the fuuny mistake to her.

| The tones actually extend beyond just the vowel, and affect timing and
| intonation of the whole word, however.  But they are assigned to the
| stressed vowel only, and are counted as various phonemic variants of that
| vowel.
| 
| The situation might work out similarly in Norwegian, dunno.

  I do not know Classical Attic Greek so I cannot say for certain, but your
  brief description makes me believe there is a good chance of a similarity.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Florian Hars
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <slrnaa0a5h.46l.florian@prony.bik-gmbh.de>
Erik Naggum schrieb im Artikel <················@naggum.net>:
> * Thomas Bushnell, BSG
>| Tonal differences are sometimes phonemic and sometimes not
> 
>   *sigh*� My native language has tonemes.  Yours does not.  Trust me on
>   this, OK?  Go look it up if you doubt me.

Some data points on "toneme" from the web:
The American Heritage� Dictionary:
  A type of phoneme
The Concise Oxford Dictionary of Linguistics:
  A unit of pitch, especially in tone languages, treated as or
  analogously to a phoneme.
http://www.factmonster.com:
  a phoneme consisting of a contrastive feature of tone in a tone
  language

Yours, Florian.
From: Thien-Thi Nguyen
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <kk9zo0usx4l.fsf@ttn1.best.vwh.net>
Erik Naggum <····@naggum.net> writes:

>   No, it is not a grey area.  It just does not apply to English.  Study
>   Norwegian or Thai.

vietnamese is a good example to study for those not familiar w/ this kind of
language feature because the representation of the tones is explicit (in the
accents).

word play in vietnamese often involves varying these tones.

         ~  /  /  ^  \
        ca ca co co ca      (approximately, some markings omitted!)
                  .

(each fish has an uncle tomato.)  the ~ means make your voice kind of swirly,
the / means make it go higher, \ means make it go lower and . under the vowel
means make your voice go really low (there is also ? which makes your voice go
higher in a sort of question-like way).  the name of each accent uses the
accent, which gives some insight on how spelling is taught: first you say the
unemphasized constituent phonemes then you say the accent; repeat, eliding the
naming of the accent by using it.  see sesame street for (weird to me) english
adaptation...

this representation was introduced by the french during colonial times and
reflects some french cultural values (rationality, consistency).  viet-nam has
a very high literacy rate due to this, i've been told.

thi
From: Ingvar Mattsson
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <87663jcpdx.fsf@gruk.tech.ensign.ftech.net>
·········@becket.net (Thomas Bushnell, BSG) writes:

> Erik Naggum <····@naggum.net> writes:
[SNIP]
> >   Consider the phonemes of the word "really".  The toneme is the difference
> >   in pronunciation between "Really?" and "Really." and "Really!".
> 
> Yeah, but there it's a matter of marking, which is different than
> tone.  A better example in English is between homographs like
> "conduct" (a noun, stress on the first syllable) and "conduct" (a
> verb, stress on the second syllable).  

If I remember my youth (and Norwegian workmate from then) correctly,
Norwegian is similar to Swedish in this regard and there is no
difference in stress pattern between quite a few similar-sounding-but-
obviously-different words.

//Ingvar
-- 
(defun m (a b) (cond ((or a b) (cons (car a) (m b (cdr a)))) (t ())))
From: Torsten
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <a7qt9n$1ddr$2@news.cybercity.dk>
Thomas Bushnell, BSG <·········@becket.net> skrev:

> Oh, ok.  That's a good point; the term "phoneme" is ambiguous I think.
> Tonal differences are sometimes phonemic and sometimes not, but I now
> understand what you mean.  Whether a tonal or length difference should
> be officially phonemic is a matter style and not any real linguistics,
> as far as I can tell.

Tone is phonemic in Norwegian and Swedish. Tonal differences can
be used to form minimal pairs in those languages, as Erik has
already shown earlier.

-- 
Torsten
From: Alain Picard
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <861ye7ag0d.fsf@gondolin.local.net>
Erik Naggum <····@naggum.net> writes:

> 
>   French, for instance, has no stress, but tends to use maringally shorter
>   and longer vowels.  They also have no tonemes, so they French have very
>   _serious_ problems dealing with other languages and sound ridiculous in
>   almost every other language than their own.

What makes you think they don't sound equally ridiculous in French?  ;-)

In high school, I never did understand what the English teacher was
going on about, with his "iambic pentameter" stuff.  If you come from
a monotonic language, the whole thing doesn't make a lot of sense.
Oh well, _our_ rhymes are a lot more exact.

*Years* later, having married an anglophone and lived in english
society for a few years, it was finally explained to me that english
has this "stress" thing...  my accent improved markedly after that.

-- 
It would be difficult to construe        Larry Wall, in  article
this as a feature.			 <·····················@netlabs.com>
From: Matthias Blume
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <m38z8gdmtg.fsf@hanabi.research.bell-labs.com>
Erik Naggum <····@naggum.net> writes:

>   Some languages also have tonemes, not just phonemes.  Norwegian is among
>   them.  The phonemes of the Noreegian words for "farmers", "prayers" and
>   "beans" are the same, but the tonemes differ.  Immigrants often have
>   farmers for dinner and purchase produce directly from beans as a result.
>   The word for "farmers" is spelled "b�nder" but "beans" and "prayers" are
>   both spelled "b�nner".  Note that this is not a question of stress.  All
>   three stress the first syllable exactly the same, and do not stress the
>   final syllable.

So what?  What does this have to do with anything?  I have already
pointed out examples (albeit not from Norwegian, which I don't know at
all) for this phenomenon.  Pronunciation and spelling are often at
odds.  Therefore, one cannot argue on the basis of phonetics which
visual distinctions in the written language matter and which ones
don't.  As far as I am concerned, uppercase and lowercase are not the
same.  In German, this is simply a fact of how the written language is
defined.  Getting the capitalization wrong is a spelling error just
like using the wrong vowel, missing an 'h' somewhere, using 'ss' where
'�' should be used, joining words where they ought to be separated and
vice versa, and so and and so forth.  Of course, many of these
distinctions are redundant to some degree.  Case distinctions are not
the only redundancies.  Should we abolish all whitespace just because
with some practice one can infer where word boundaries are?  I haven't
seen anyone suggesting this.  (And again, there are precedents for
such a things, for example in some far eastern languages where words
are not visibly separated in the written language.)

>   OF course, you are a Scheme freak and a tourist in comp.lang.lisp, the
>   very canonicalization of the irresponsible trouble-maker who thinks he is
>   an outsider to the community he torments with "you are silly who do it
>   differently from me" attitudes.  Thank you for contributing to the
>   _impression_ that Scheme is the language of choice of deranged lunatics.

Quite funny that you think I am a Scheme person...
(Especially considering that Scheme, like CL, uses case-insensitive identifiers.)

Matthias
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226105259872065@naggum.net>
* Matthias Blume
| So what?  What does this have to do with anything?

  Why are you still talking?  This is "supremely silly" and you keep
  blabbering?  What for?

| As far as I am concerned, uppercase and lowercase are not the same.

  Nobody has said they are.  Please just grasp this, OK?  That some
  distinction is incidental does mean that it is not there.  I wonder what
  your limited brainpower has concluded that this discussion is all about
  when you are so devoid of understanding.  Geez, you are _so_ stupid.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Matthias Blume
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <m3vgbklyus.fsf@hanabi.research.bell-labs.com>
Erik Naggum <····@naggum.net> writes:

> | As far as I am concerned, uppercase and lowercase are not the same.
> 
>   Nobody has said they are.  Please just grasp this, OK?  That some
>   distinction is incidental does mean that it is not there.

I meant: they are intrinsically not the same.
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226112534714801@naggum.net>
* Matthias Blume
| I meant: they are intrinsically not the same.

  Then your position is not only misguided, but utterly false, you
  supremely silly man.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Matthias Blume
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <m3n0wvedhp.fsf@hanabi.research.bell-labs.com>
Erik Naggum <····@naggum.net> writes:

> * Matthias Blume
> | I meant: they are intrinsically not the same.
> 
>   Then your position is not only misguided, but utterly false, [ ... ]

I do understand what you are trying to get at with you distinction
between "intrinsic" and "incidental".  However, I think that this very
distinction itself is in the "incidental" category.  It is absolutely
not clear where to draw the line between "intrinsic" features of
spelling and "incidental" ones.  Why single out the sentence-initial
capitalization rule?  Why not also get rid of repeated consonants?  At
least in some languages, the rules about those are just as
"incidental" as the capitalization rules.

What you are trying to do is separate content and form.  I wish you
good luck in this endeavor, but also predict that it is doomed from
the beginning.  If you could actually do it, that would be great: We
could store the intrinsic parts of German text and then "render" it
according to the spelling and grammar rules of the day.  (There
recently has been a big official -- and very controversial -- reform
of the spelling rules in German.  They attack precisely some of those
"incidental" aspects, but strangly, leave others (such as
sentence-initial capitalization) untouched.)

By the way, I assume that the abuse that you heap on people when you
get into one of your famous tirades is "incidental"...

Matthias
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226138169931756@naggum.net>
* Matthias Blume <········@shimizu-blume.com>
| I do understand what you are trying to get at with you distinction
| between "intrinsic" and "incidental".  However, I think that this very
| distinction itself is in the "incidental" category.

  Up-casing all letters in a heading is clearly incidental.  Capitalization
  of each non-preposition in a title ic clearly incidental.  Capitalization
  of the sentence-initial word is clearly incidental.  Capitalization of
  proper names is perhaps intrinsic, in which case information should not
  be lost when you write "Smith said ..." and later change it to "After a
  brief pause, Smith said ...", should be recoverable from titles and
  headlines, and should therefore be regarded as information that
  incidental capitalization actually _destroys_.  If you think intrinsic
  capitalization is so important, you would have objected to the incidental
  capitalization or upcasing of words because of their information loss.
  You do not, so I conclude that you are completely _unconcerned_ with this
  loss of information from incidental capitalization, and therefore do
  _not_ regard intrinsic capitalization as important.

| It is absolutely not clear where to draw the line between "intrinsic"
| features of spelling and "incidental" ones.

  It appears that you think that intrinsic-p = (complement incidental-p).
  This is unwarrented, and most of your argumentation just falls to pieces
  because you believe this and argue argainst a negative.

| Why single out the sentence-initial capitalization rule?  Why not also
| get rid of repeated consonants?

  What the fuck are you talking about?  Geez, are you for real?

| At least in some languages, the rules about those are just as
| "incidental" as the capitalization rules.

  Whnat _are_ you unable to deal with?

| What you are trying to do is separate content and form.  I wish you good
| luck in this endeavor, but also predict that it is doomed from the
| beginning.

  Look, you are so stupid that this is getting seriously boring: The whole
  context of the discussion is what if we could design things all over?
  Your insipid complaints and your moronic attitude problems are hostile.

| By the way, I assume that the abuse that you heap on people when you get
| into one of your famous tirades is "incidental"...

  In your case, stupidity and hostility seem to be intrinsic.  Just THINK,
  and you will find a nicer side of me.  Be an annoying asshole, and you
  find me unpleasant.  It really is that simple.  Some people _are_ no more
  than annoying assholes and think it is my fault.  This is not so, but it
  sure seem to make annoying assholes happier to think it is.  This is how
  they remain annoying assholes.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Andreas Eder
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <m3zo0wuzsm.fsf@elgin.eder.de>
Kent M Pitman <······@world.std.com> writes:

> Capitalization _is_ incidental.  It is ceremonially marked in written
> text, but my impression based on a basic knowledge of linguistics and
> a casual outside view of German [I don't purport to speak the
> langauge] is that German people may claim that "weg" and "Weg" are
> different words, but the capitalization is not pronounced audibly, so
> there is generally enough contextual information to disambiguate in
> speech. 

Well, in fact 'Weg' and 'weg' *are* pronounced differently, one with a
long 'e' and the other with a short one - that is because they are
different words. Should you incidentally start a sentence with 'weg',
thus writing it with capital 'W' it would still be pronounced like
'weg'. This might be difficult to understand, but that is how natural
languages are, I guess.

Andreas
-- 
Wherever I lay my .emacs, there�s my $HOME.
From: Dorai Sitaram
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <a7o7eu$i8a$1@news.gte.com>
In article <··············@elgin.eder.de>,
Andreas Eder  <············@t-online.de> wrote:
>Kent M Pitman <······@world.std.com> writes:
>
>> Capitalization _is_ incidental.  It is ceremonially marked in written
>> text, but my impression based on a basic knowledge of linguistics and
>> a casual outside view of German [I don't purport to speak the
>> langauge] is that German people may claim that "weg" and "Weg" are
>> different words, but the capitalization is not pronounced audibly, so
>> there is generally enough contextual information to disambiguate in
>> speech. 
>
>Well, in fact 'Weg' and 'weg' *are* pronounced differently, one with a
>long 'e' and the other with a short one - that is because they are
>different words. Should you incidentally start a sentence with 'weg',
>thus writing it with capital 'W' it would still be pronounced like
>'weg'. This might be difficult to understand, but that is how natural
>languages are, I guess.
>


To me, that case is indeed ornamental is supported by
the fact that it appears to be permissible to
upper-case a German sentence in its entirety
without construing it as a loss of information.

BITTE EIN BIT
ICH BIN EIN BERLINER
DIE MAUER MUSS WEG!

usw.

Ie, things like titles, slogans, and billboards, but
also consider the GPL or other license text in the
German, where large globs of the prose are in all caps.
Legal prose, it seems to me, would especially not court
information loss in this manner if it was felt there
really was a risk.

I'm curious: Is there an example, however
frivolous, where WEG in an all-caps sentence
could be ambiguous?

BTW, the {Weg, weg} pair seems very like the {produce
(noun), produce (verb)} pair in English.  Like Weg/weg,
produce/produce are pronounced differently.
However, they  don't rely on capitalization, even
though the grammatical context used to disambiguate
between them has fewer cues than the German.

--d 
From: Matthias Blume
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <foy9ggckwm.fsf@trex10.cs.bell-labs.com>
····@goldshoe.gte.com (Dorai Sitaram) writes:

> I'm curious: Is there an example, however
> frivolous, where WEG in an all-caps sentence
> could be ambiguous?

Yes, there is a joke about a stupid person who tries to figure out
which street he is in and comes up with

   "We are on the trail with the nukes."

because he misread the slogan

   "WEG MIT DEN ATOMWAFFEN"   (meaning "GET RID OF THE NUKES")

as a streetsign.

> BTW, the {Weg, weg} pair seems very like the {produce
> (noun), produce (verb)} pair in English.  Like Weg/weg,
> produce/produce are pronounced differently.

In this case, there is at best a very remote semantic relationship (if
any).  It is definitely nowhere near a noun/verb sort of thing.

Matthias
From: Kent M Pitman
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <sfweli8rz1w.fsf@shell01.TheWorld.com>
Matthias Blume <········@shimizu-blume.com> writes:

> ····@goldshoe.gte.com (Dorai Sitaram) writes:
> 
> > I'm curious: Is there an example, however
> > frivolous, where WEG in an all-caps sentence
> > could be ambiguous?
> 
> Yes, there is a joke about a stupid person who tries to figure out
> which street he is in and comes up with
> 
>    "We are on the trail with the nukes."
> 
> because he misread the slogan
> 
>    "WEG MIT DEN ATOMWAFFEN"   (meaning "GET RID OF THE NUKES")
> 
> as a streetsign.

Yes, but this kind of confusion can happen whether case is involved or not,
and I think it's not fair to ascribe it to case as the principal cause.
We have signs on our highways that say "FINE FOR LITTERING".  Writing
them in lowercase won't help. ;-)

> > BTW, the {Weg, weg} pair seems very like the {produce
> > (noun), produce (verb)} pair in English.  Like Weg/weg,
> > produce/produce are pronounced differently.
> 
> In this case, there is at best a very remote semantic relationship (if
> any).  It is definitely nowhere near a noun/verb sort of thing.

There is a phenomenon in English speech wherein stress matters, too,
and we sometimes italicize not just to control emphasis but to
actively disambiguate.  A prime example of this is an effect called
anaphoric de-stressing (that is, lessening stress in order to turn a
reference into an anaphoric reference--that is, a reference to a previously
noun entity--instead of a non-anaphoric referenc--, that is, a reference to a
newly introduced entity).  The example I've seen is a story of a newsreader
misreading an account of how a man, upon hearing his wife had had an affair
with another man, had said he wanted to shoot the bastard.  (Note how the
sentence changes meaning, depending on whether if put stress on _shoot_
or on _bastard_.)  Written Englsh doesn't mark this distinction in writing,
even though it's present and my some stretch important in spoken English.  
People figure it out.
From: Nils Goesche
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <a7pg21$n2u71$2@ID-125440.news.dfncis.de>
In article <············@news.gte.com>, Dorai Sitaram wrote:
> In article <··············@elgin.eder.de>,
> Andreas Eder  <············@t-online.de> wrote:

>>Well, in fact 'Weg' and 'weg' *are* pronounced differently, one with a
>>long 'e' and the other with a short one - that is because they are
>>different words. Should you incidentally start a sentence with 'weg',
>>thus writing it with capital 'W' it would still be pronounced like
>>'weg'. This might be difficult to understand, but that is how natural
>>languages are, I guess.
> 
> To me, that case is indeed ornamental is supported by
> the fact that it appears to be permissible to
> upper-case a German sentence in its entirety
> without construing it as a loss of information.

This is true for /some/ sentences, but not /all/ sentences (I posted
an example).  In the seventies, it was popular among radical leftists
to write everything in lowercase.  The slogan was: ``Wer groszschreibt
ist auch fuer's Groszkapital'', freely translated something like
``Friends of capitalization are also friends of the capital'', or
some such.  It is /significantly/ harder to read a German text
without proper capitalization.

Regards,
-- 
Nils Goesche
"Don't ask for whom the <CTRL-G> tolls."

PGP key ID 0x42B32FC9
From: Holger Schauer
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <whvgbikpe8.fsf@allblues.coling.uni-freiburg.de>
On 25 Mar 2002, Dorai Sitaram wrote:
> Andreas Eder  <············@t-online.de> wrote:
>>Kent M Pitman <······@world.std.com> writes:
>>
>>> Capitalization _is_ incidental.  It is ceremonially marked in
>>> written text, but my impression based on a basic knowledge of
>>> linguistics and a casual outside view of German [I don't purport
>>> to speak the langauge] is that German people may claim that "weg"
>>> and "Weg" are different words, but the capitalization is not
>>> pronounced audibly, so there is generally enough contextual
>>> information to disambiguate in speech.

I think you are confusing two seperate issues: written and spoken
language. If we assume that language very likely has not started from
words, and that humans seem to be somehow equipped with some kind of
language �module�, it seems natural that how something is spoken and
written is quite often different. As an example, consider the normal
alphabet we're using here which certainly does not even reflect how
something is pronounced; which is why linguists came up with the
phonetic alphabet, after all. On the contrary, (grasping) written
language may have also some aspects that are unique to the fact that
it is /written/. Whitespace comes to mind which was introduced to help
identify word and sentence boundaries. That such things do matter
should be obvious how has ever read a large and poorly type set
document. I see the matter of capitalization on the same level: it may
help you when you do have to distinguish ambigious cases. 

>>Well, in fact 'Weg' and 'weg' *are* pronounced differently, one with
>>a long 'e' and the other with a short one - that is because they are
>>different words.

But Kent is surely right in saying that indeed the capitalization is
usually not pronounced.

> To me, that case is indeed ornamental is supported by
> the fact that it appears to be permissible to
> upper-case a German sentence in its entirety
> without construing it as a loss of information.
> 
> BITTE EIN BIT
> ICH BIN EIN BERLINER
> DIE MAUER MUSS WEG!

Capitalization is a tool for disambiguating ambigious words. One tool.
If your using an all caps font, you can't give that information. So
you'll probably try to avoid ambiguous cases, which are rare, btw.
Others have already posted examples in which some confusion might be
avoided by looking at case. As already has been posted by somebody
else, the spelling reform, while fixing some irritating rules,
resulted in the introduction of many more ambigious cases (most of
them not case related, though).

However, why German capitalization rules are how they are, is beyond
me (I can live with them, for sure). As (simplifying) only nouns are
capitalized, it seems like capitalization should help in getting fast
to the right grammatical categorization, helping the syntax process.
But there are only rarely cases in which the fact that something is a
noun and not, say, a verb is really problematic, there is typical
enough contextually provided grammatical and semantic information.
Actually I think, English would be much more in need of noun
capitalization with its often lax handling of embedded sentences ("The
horse raced past the barn fell" gives me the headaches). A lot more
problematic issues arise from homophones, (e.g. the bank you sit on
vs. the bank you give your money to) which is not at all addressed by
noun capitalization.

Holger

-- 
---          http://www.coling.uni-freiburg.de/~schauer/            ---
"In Scheme, as in C, every programmer has to be a genius, but often comes
 out a fool because he is so far from competent at every task required."
                  -- Erik Naggum in comp.lang.lisp
From: Jochen Schmidt
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <a7qcif$jhc$1@rznews2.rrze.uni-erlangen.de>
Kent M Pitman wrote:


> Capitalization _is_ incidental.  It is ceremonially marked in written
> text, but my impression based on a basic knowledge of linguistics and
> a casual outside view of German [I don't purport to speak the
> langauge] is that German people may claim that "weg" and "Weg" are
> different words, but the capitalization is not pronounced audibly, so
> there is generally enough contextual information to disambiguate in
> speech.

In German nouns are written uppercase. "Weg" is a noun which means "way" in 
English. On the other side the verb "weg" means "away". They are pronounced
differently ("Weg", a long "e" and "weg" a short one) and they are of 
course semantically different symbols. Capitalizing nouns in German is a 
redundant thing - there is no bigger problem than in English. In English 
there are too examples of colliding nouns and verbs (for example "saw") but
they are resolved out of their grammatical or semantic role in the 
sentence. If a verb is at the beginning of a sentence it is capitalized too 
in German (like in "Weg hier!" <-> "Away from here!").
Capitalizing nouns in written text makes the process of disambiguation 
easier but raises the burden for the writer. Many people in Germany who do 
not know German grammatics well enough have problems with proper 
capitalization when writing. Even if they know it well enough - many 
chatters in IRC write completely without capitalization because they can 
type faster this way.

ciao,
Jochen

--
http://www.dataheaven.de
From: Nils Goesche
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <a7qdt2$mtcfg$1@ID-125440.news.dfncis.de>
In article <············@rznews2.rrze.uni-erlangen.de>, Jochen Schmidt wrote:
> 
> In German nouns are written uppercase. "Weg" is a noun which means "way" in 
> English. On the other side the verb "weg" means "away". They are pronounced
> differently ("Weg", a long "e" and "weg" a short one) and they are of 
> course semantically different symbols. Capitalizing nouns in German is a 
> redundant thing - there is no bigger problem than in English.

There is lots of redundancy in both spelling and human language.  If
you omitted every third vowel, I'd still understand your writings.
And there are many people who say that it /is/ a bigger problem than
in English (in English, there is /no/ problem).  Possibly because of
greater freedom in the order of words in German, I don't know.

> Even if they know it well enough - many 
> chatters in IRC write completely without capitalization because they can 
> type faster this way.

They also write `n8' instead of `Good night!'.  I hope you are not
proposing to eliminate all redundancy in our language :-)

Regards,
-- 
Nils Goesche
"Don't ask for whom the <CTRL-G> tolls."

PGP key ID 0x42B32FC9
From: Jochen Schmidt
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <a7qhkc$m2f$1@rznews2.rrze.uni-erlangen.de>
Nils Goesche wrote:

> In article <············@rznews2.rrze.uni-erlangen.de>, Jochen Schmidt
> wrote:
>> 
>> In German nouns are written uppercase. "Weg" is a noun which means "way"
>> in English. On the other side the verb "weg" means "away". They are
>> pronounced differently ("Weg", a long "e" and "weg" a short one) and they
>> are of course semantically different symbols. Capitalizing nouns in
>> German is a redundant thing - there is no bigger problem than in English.
> 
> There is lots of redundancy in both spelling and human language.  If
> you omitted every third vowel, I'd still understand your writings.

Of course. Redunancy can be a good thing to raise transmission safety and 
to speed up recognition. If you omitted every third vowel you would force 
people to search for words that fit the pattern and raise the burden for 
the reader.

> And there are many people who say that it /is/ a bigger problem than
> in English (in English, there is /no/ problem).  Possibly because of
> greater freedom in the order of words in German, I don't know.

This is probably true - my claim was merely a subjective one. As others 
pointed out it is no _big_ problem to understand German texts written in 
either all uppercase or all lowercase. It is certainly true that for most 
Germans it is easier to read proper capitalized texts. AFAICT there are 
only some cases in which missing capitalization would make German 
incomprehensible.

>> Even if they know it well enough - many
>> chatters in IRC write completely without capitalization because they can
>> type faster this way.
> 
> They also write `n8' instead of `Good night!'.  I hope you are not
> proposing to eliminate all redundancy in our language :-)

No - not really ;-)
I hope it did not sound like that. Redunancy is not bad - it is inherently 
important for communication. Removing all redundancy would have 
catastrophic effects.

ciao,
Jochen

--
http://www.dataheaven.de
From: Michael Parker
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <9f023346.0203251213.4a36ba04@posting.google.com>
Erik Naggum <····@naggum.net> wrote in message news:<················@naggum.net>...
>   ... but at least designing languages
>   and coding conventions to use case would not likely happen if case was
>   regarded as just as incidental as color or typeface.

OTOH, if terminals had gotten color and typefaces earlier, maybe
programming languages would have evolved to use them.  Maybe give
each namespace its own color, so you would specify the value of a
name by putting it in blue, the function by using red, keywords in
italics, macros in green.  The mind boggles at the possibilities.
In fact, if you want to boggle your mind, see 

http://www.sleepless-night.com/cgi-bin/twiki/view/Main/ColorForth

Which describes Chuck Moore's latest dialect of forth that does
this sort of thing.
From: Erik Naggum
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <3226096405923415@naggum.net>
* Michael Parker
| OTOH, if terminals had gotten color and typefaces earlier, maybe
| programming languages would have evolved to use them.

  Only if we had also had a stateless coding for them, statefulness being
  so frigthening to the kinds of programmers who are likely to invent new
  syntaxes.

| Maybe give each namespace its own color, so you would specify the value
| of a name by putting it in blue, the function by using red, keywords in
| italics, macros in green.  The mind boggles at the possibilities.

  Especially if they also used XML to write it all, and then we can use
  cascading style sheets to control both background and foreground color.
  And programmers would have be selected from those who are not color
  blind.  This is unlikely to succeed, since the current selection from
  those who can spell has not been successful, either, and that is at least
  something you can learn.

  Thanks for the URL, though.  My mind boggles at statements like these:
  "With the huge RAM of modern computers, an operating system is no longer
  necessary."

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: Christopher Browne
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <m3u1r43sfg.fsf@chvatal.cbbrowne.com>
In the last exciting episode, Erik Naggum <····@naggum.net> wrote::
> * Michael Parker
> | OTOH, if terminals had gotten color and typefaces earlier, maybe
> | programming languages would have evolved to use them.
>
>   Only if we had also had a stateless coding for them, statefulness being
>   so frigthening to the kinds of programmers who are likely to invent new
>   syntaxes.
>
> | Maybe give each namespace its own color, so you would specify the value
> | of a name by putting it in blue, the function by using red, keywords in
> | italics, macros in green.  The mind boggles at the possibilities.
>
>   Especially if they also used XML to write it all, and then we can use
>   cascading style sheets to control both background and foreground color.
>   And programmers would have be selected from those who are not color
>   blind.  This is unlikely to succeed, since the current selection from
>   those who can spell has not been successful, either, and that is at least
>   something you can learn.
>
>   Thanks for the URL, though.  My mind boggles at statements like these:
>   "With the huge RAM of modern computers, an operating system is no longer
>   necessary."

Yes, that seems rather a strange comment.

Note that one of Moore's more-publicized quasi-recent projects
involved building a CAD system for designing microprocessors.

His approach was to basically write the application-cum-operating
system based on a tiny kernel of Forth instructions which basically
meant he started with 80486 assembler, and built on top of that.

Apparently it offered vast opportunities to avoid all kinds of cruft
that tends to get built into CAD systems, but what it really amounted
to was that he built his system as an embedded system on top of bare
Intel metal.

I think a lot of his argument is that people keep building cruft on
top of cruft, when they might be better off with a _good_ embedded
system.  

Consider the horrors of MS Office: We might be better off if, instead
of continually being mandated by the latest bloatware upgrade to
upgrade their system to the latest "Pentium IV with more memory than
anyone could _conceive_ of ten years ago," people bought cheap
electronic typewriters with bare bits of computing power.  

If people spent their time _typing_, instead of trying to figure out
which menu allows them to change some bit of formatting, they might
get more work done.  Consider that back in the old days, Unix used to
run in 128K words of memory, and CP/M machines could handle word
processing, spreadsheets, and databases in 56K of RAM.  The notion
that you need 256MB of RAM to realistically Windows XP should be
offensive.

In any case, Moore is a fascinating character. He is perhaps not
always to be taken seriously, but he's had more inspired ideas than
most people ever learn about...
-- 
(reverse (concatenate 'string ········@" "enworbbc"))
http://www.ntlug.org/~cbbrowne/wp.html
"Cars  move  huge  weights  at  high  speeds  by  controlling  violent
explosions many times a  second. ...car analogies are always fatal..."
-- <········@my-dejanews.com>
From: Seth Gordon
Subject: Re: case-sensitivity and identifiers (was Re: Wide character  implementation)
Date: 
Message-ID: <3CA09555.BEC33FC4@genome.wi.mit.edu>
Christopher Browne wrote:
> 
> Consider the horrors of MS Office: We might be better off if, instead
> of continually being mandated by the latest bloatware upgrade to
> upgrade their system to the latest "Pentium IV with more memory than
> anyone could _conceive_ of ten years ago," people bought cheap
> electronic typewriters with bare bits of computing power.
>
> If people spent their time _typing_, instead of trying to figure out
> which menu allows them to change some bit of formatting, they might
> get more work done.

You assume that people (or corporate purchasing agents) spent money on
more expensive word processors only because they wanted to get more work
done.  I am not sure this is true.  :-)

-- 
"Any fool can write code that a computer can understand.
 Good programmers write code that humans can understand."
 --Martin Fowler
// seth gordon // wi/mit ctr for genome research //
// ····@genome.wi.mit.edu // standard disclaimer //
From: Duane Rettig
Subject: Re: case-sensitivity and identifiers (was Re: Wide character implementation)
Date: 
Message-ID: <4eli9j8nt.fsf@beta.franz.com>
Ed L Cashin <·······@uga.edu> writes:

> A general principle of mine is that if things are distinguishable,
> they should not be collapsed but the distinction should be preserved
> whenever possible.  Treating different characters as the same
> character, or treating different character sequences as equivalent,
> should be postponed as long as possible in order to preserve
> information.

This is your opinion, and many people agree with you, but many do not,
as well.  This is a very controversial subject.   And it's not just in
comp.lang.lisp that you'll find this same controversy; at about the same
time as our last discussion here there was a similar one raging on
comp.arch.  The difference was that here the case-insensitive style being
advocated was (of course) the case-folding style that the Common Lisp
reader standardizes, and in comp.arch the predominant case-insensitive
style being argued was the "case-preserving" style, which is the kind
of recognition style that both Mac and Windows filesystems support
(i.e. first reference gets internalized as originally specified, but
subsequent references are matched against the filename without regard
to case).  This case-preserving insensitive style was being pitted
against the Unix case-sensitive style.  Of course, neither side
changed the other's mind.

Arguing case-sensitivity is very similar to arguing endianness; there
are good arguments for both big-endian and little-endian, and neither
side is fully right or fully wrong, though a decision must usually be
made, because it is generally hard to mix the two together in the same
machine.

> Are you suggesting that this principle is inappropriate to apply to
> the character sequences that compose identifiers in source code?  That
> would mean that "ABLE" is the same identifier as "able".  I must admit
> that when I first found out that current lisps have case-insensitive
> symbol names, I thought it reminiscent of BASIC -- kind of a throwback
> to a time when memory was much more at a premium.  (I know that Lisp
> predates BASIC.  I'm talking about my reaction.)  I'd be happy to hear
> a good case for case-insensitive identifiers.

First, I'll note (as others have) that Common Lisp does have
case-sensitive identifiers, and always has.  It is the reader that
is specified to fold to uppercase by default.  And even the
standard CL reader is highly configurable, to allow cases to be
specified by readtable options.

Second, the choice of case-sensitivity or not is not bounded by
time.  Going back to the endianness question, some engineers 10
years ago said "the little-endian side has lost".  However, I
suspect that if you count all of the little-endian machines in
existence today, you find it hard to justify that claim.  In
fact, even many computers which are generally considered to be
big-endian are now architected to allow for either endianness.

Finally, I personally believe in choice.  Our own product has
always allowed one to choose whether to decide on the Common Lisp
specified case-insensitive reader, or whether to configure the reader
to be case-sensitive by default.  Our customer base has always taken
advantage of that choice, with anywhere from approximately 20% to 35%
choosing the case-sensitive mode, and the majority choosing the Common
Lisp (case-insensitive, folding to uppercase) mode.  And of course,
this does not account for people who use lisps of both modes for
different purposes.  Nowadays, there is a slight increase in
case-sensitive mode for the purpose of interfacing relatively directly
with some currently popular case-sensitive languages.  The point,
though, is that we have always provided a choice, and always intend
to provide a choice.

In fact, Kent Pitman recently sent us a proposal for unifying
the two major case-modes that Allegro CL provides, in such a
way that the two can exist in the same lisp simultaneously.
We have an rfe (request for enhancement document) which starts
with his proposal as a basis.  I would love to see us succeed
in making this or any similar unification, and I was excited to
see Kent's proposal when he sent it to us.

It's all about choice.  Calling the case-insensitive choice a
"throwback" is the same as calling it invalid (or no longer
valid).  And based on my own experience here and in comp.arch,
that is simply incorrect.  People still choose both styles,
and probably always will.

-- 
Duane Rettig          Franz Inc.            http://www.franz.com/ (www)
1995 University Ave Suite 275  Berkeley, CA 94704
Phone: (510) 548-3600; FAX: (510) 548-8253   ·····@Franz.COM (internet)
From: Kenny Tilton
Subject: Re: case-sensitivity and identifiers (was Re: Wide character  implementation)
Date: 
Message-ID: <3CA14AED.4E9924BE@nyc.rr.com>
Ed L Cashin wrote:
> 
>  I'd be happy to hear
> a good case for case-insensitive identifiers.

I've done a ton of case-sensitive C and I've done a ton of code in
case-insensitive languages. I like case-insensitive much, much more.
Does that count?

A deeper reason is that it seems weird to use case to differentiate two
things. If I looked down and saw an app with two functions, say, ABLE-P
and able-p, meaning different things which the case was meant to convey,
I would have regretably ungenerous thoughts regarding the author.

-- 

 kenny tilton
 clinisys, inc
 ---------------------------------------------------------------
"Harvey has overcome not only time and space but any objections."
                                                        Elwood P. Dowd
From: Ed L Cashin
Subject: Re: case-sensitivity and identifiers (was Re: Wide character  implementation)
Date: 
Message-ID: <87vgbhcxnt.fsf@cs.uga.edu>
Kenny Tilton <·······@nyc.rr.com> writes:

> Ed L Cashin wrote:
> > 
> >  I'd be happy to hear
> > a good case for case-insensitive identifiers.
> 
> I've done a ton of case-sensitive C and I've done a ton of code in
> case-insensitive languages. I like case-insensitive much, much more.
> Does that count?

Does for me.  Also, I think Erik Naggum provided a good argument
undermining my original assumption that 'a' and 'A' are different
characters in reality (even though they have distinct encodings in
ASCII, as opposed to his ideal), and Kent Pitman pointed out that
psychologically, we think of strings of letters first, remembering
case less frequently.

> A deeper reason is that it seems weird to use case to differentiate
> two things. If I looked down and saw an app with two functions, say,
> ABLE-P and able-p, meaning different things which the case was meant
> to convey, I would have regretably ungenerous thoughts regarding the
> author.

Yes.  The responses I've read beg the question, "Isn't it the code
author's fault if case-sensitivity is abused?"  I mean, if the
language is case sensitive and people write poor-quality code, is that
the language designer's fault?

As for case-folding issues in the face of different languages with
different notions of upper and lower case, it seems like many hairy
issues are associated with it, and I'm looking forward to the day when
I can appreciate them all!

-- 
--Ed L Cashin            |   PGP public key:
  ·······@uga.edu        |   http://noserose.net/e/pgp/
From: Janis Dzerins
Subject: Re: case-sensitivity and identifiers (was Re: Wide character  implementation)
Date: 
Message-ID: <87663hxgsd.fsf@asaka.latnet.lv>
Ed L Cashin <·······@uga.edu> writes:

> Yes.  The responses I've read beg the question, "Isn't it the code
> author's fault if case-sensitivity is abused?"  I mean, if the
> language is case sensitive and people write poor-quality code, is that
> the language designer's fault?

Yes, it is the language designer's fault.  At least to some degree.

Languages make some errors more easy to make, some others harder to
make, some concepts easier to express, some harder.  And that is in
the language designer's power to shape.

> As for case-folding issues in the face of different languages with
> different notions of upper and lower case, it seems like many hairy
> issues are associated with it, and I'm looking forward to the day when
> I can appreciate them all!

Just curious -- which languages have different notions of upper and
lower case characters (if you are talking about programming languages,
of course)?

-- 
Janis Dzerins

  Eat shit -- billions of flies can't be wrong.
From: Kenny Tilton
Subject: Re: case-sensitivity and identifiers (was Re: Wide character   implementation)
Date: 
Message-ID: <3CA34B56.17EE8065@nyc.rr.com>
Ed L Cashin wrote:
> 
> Kenny Tilton <·······@nyc.rr.com> writes:
> 
> 
> >  If I looked down and saw an app with two functions, say,
> > ABLE-P and able-p, meaning different things which the case was meant
> > to convey,...
> 
> Yes.  The responses I've read beg the question, "Isn't it the code
> author's fault if case-sensitivity is abused?"  

Oh, I jumped in on the middle of this (just can't keep up with c.l.l.
anymore!) and maybe I missed something. Are you saying that ABLE-P vs
able-p is poor quality code, but SomeThingElse vs somethingelse is not?
If so, what would SomeThingElse be? If not...never mind. :)

-- 

 kenny tilton
 clinisys, inc
 ---------------------------------------------------------------
"Harvey has overcome not only time and space but any objections."
                                                        Elwood P. Dowd
From: IPmonger
Subject: Re: case-sensitivity and identifiers (was Re: Wide character   implementation)
Date: 
Message-ID: <m3u1r0394b.fsf@validus.delamancha.org>
Kenny Tilton <·······@nyc.rr.com> writes:

> Ed L Cashin wrote:
>> 
>> Kenny Tilton <·······@nyc.rr.com> writes:
>> 
>> 
>> >  If I looked down and saw an app with two functions, say,
>> > ABLE-P and able-p, meaning different things which the case was meant
>> > to convey,...
>> 
>> Yes.  The responses I've read beg the question, "Isn't it the code
>> author's fault if case-sensitivity is abused?"  
>
> Oh, I jumped in on the middle of this (just can't keep up with
> c.l.l. anymore!) and maybe I missed something. Are you saying that ABLE-P vs
> able-p is poor quality code, but SomeThingElse vs somethingelse is not? 
> If so, what would SomeThingElse be? If not...never mind. :)

    I believe what he's saying is that whichever of those are poor quality
  code - I would suggest that they all are - it isn't the fault of the
  language designer(s) but of the programmer.

  Which brings up the question:

  Is it at all possible to use case-sensitivity in an appropriate manner?


-jon
-- 
------------------
IPmonger
········@delamancha.org
From: Doug Quale
Subject: Re: case-sensitivity and identifiers (was Re: Wide character   implementation)
Date: 
Message-ID: <87663gmq9j.fsf@charter.net>
IPmonger <········@delamancha.org> writes:

>   Which brings up the question:
> 
>   Is it at all possible to use case-sensitivity in an appropriate manner?

Sure, at least in non-lispy languages.  A lot of C code uses the
convention that identifiers in all caps are macros.  Prolog (Edinburgh
syntax) requires leading capitalization to distinguish variables from
constant symbols (eliminating the need to quote constants in Prolog).
In Haskell, capitalized symbols indicate data types and constructors.
This makes the Haskell pattern matching syntax nicer.  Some languages
guarantee that all built-in identifiers will be of one case so that
the user can use identifiers with the other case without fear of
colliding with a current or future language-defined id.

As was noted a long time ago in this thread, in all these cases it's
harder to read the code aloud since capitalization doesn't change the
pronunciation.  In practice, I think users of these languages have
found that case distinctions work well all the same.


-- 
Doug Quale
From: ozan s. yigit
Subject: Re: Wide character implementation
Date: 
Message-ID: <4da3d9af.0203281000.7a8141cb@posting.google.com>
Erik Naggum:
> 	... It does an excellent job of explaining the
>   distinction between glyph and character.  I think you need it much more
>   than trying to defend yourself by insulting me with your ignorance.

imagine how much time you would have saved yourself and everyone else
had you just posted a useful part of the actual unicode standard, for
example pp. 13, "Characters, Not Glyphs" [1]

	The Unicode standard draws a distinction betweeb /characters/ which
	are the smallest components of written language that have semantic
	value, and /glyphs/, which represent the shapes that characters can
	have when they are rendered or displayed. Various relationships may
	exist between character and glyph; a single glyph may correspond to
	a single character, or to a number of characters, or multiple glyphs
	may result from a single character.

	[etc]

but it is more fun to lecture, and madly scribble on the board, isn't it? :-]

oz
---
[1] The Unicode Standard Version 3.0, Addison-Wesley, 2000.
From: Erik Naggum
Subject: Re: Wide character implementation
Date: 
Message-ID: <3226329016410517@naggum.net>
* ··@cs.yorku.ca (ozan s. yigit)
| imagine how much time you would have saved yourself and everyone else
| had you just posted a useful part of the actual unicode standard, for
| example pp. 13, "Characters, Not Glyphs" [1]

  Imagine how much time people would have saved _everybody_ if they cared
  to study something before they thought they had the right to produce
  "opinions".  "When did ignorance become a point of view?"  Then imagine
  how much time it would take to find out what some ignorant fuck needs to
  hear in order to become unconfused.  It is not my task to educate people
  who voice opinions on what they do not have the intellectual honesty and
  wherewithal to realize that they do not know sufficiently well.  People
  who cannot keep track of what they know and what they do not know, should
  shut the fuck up, but they never will, precisely because they are unaware
  of what they know and do not know.  Wade Humeniuk gave us a good analogy
  to his yoga classes and the mat-abusers.  Non-thinking cretins who post
  ignorant opinions to newsgroup are just the same kind of inconsiderate
  bastards.  But you choose to _defend_ them.  What does that make you?
  Those who have the intellectual honesty to separate what they know from
  what they just assume, also know where they heard something and can rate
  its probability and credibility.  Those are worth helping, because they
  are likely to learn from it.  Those who are unlikely to learn from what
  you tell them, are a waste of time.

| but it is more fun to lecture, and madly scribble on the board, isn't it? :-]

  Your life experiences apparently differ quite significantly from mine,
  but if you feel happy about exposing yourself like this, please do.  More
  idiotic drivel that lets the world know how you think is probably going
  to be the result of your obvious desire to inflame rather than inform, so
  go ahead, make a spectacle of yourself.  This newsgroup is quite used to
  your kind by now.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: ozan s yigit
Subject: Re: Wide character implementation
Date: 
Message-ID: <vi4ofh8dw19.fsf@blue.cs.yorku.ca>
[erik's bombastic drivel elided]

heh heh heh, nice try erik, but you are no mikhail zeleny, alas. :]

oz
From: Erik Naggum
Subject: Re: Wide character implementation
Date: 
Message-ID: <3226361140515648@naggum.net>
* ozan s yigit <··@blue.cs.yorku.ca>
| [erik's bombastic drivel elided]
| 
| heh heh heh, nice try erik, but you are no mikhail zeleny, alas. :]

  Oh, great, another nutjob at large.

///
-- 
  In a fight against something, the fight has value, victory has none.
  In a fight for something, the fight is a loss, victory merely relief.
From: ozan s. yigit
Subject: Re: Wide character implementation
Date: 
Message-ID: <4da3d9af.0203282232.60455a4f@posting.google.com>
Erik Naggum:

> | heh heh heh, nice try erik, but you are no mikhail zeleny, alas. :]
> 
>   Oh, great, another nutjob at large.

read your previous post. it speaks volumes.

oz
---
dreams already are. -- mark v. shaney
From: Ray Dillinger
Subject: Re: Wide character implementation
Date: 
Message-ID: <3C990D44.78917567@sonic.net>
"Thomas Bushnell, BSG" wrote:
> 
> If one uses tagged pointers, then its easy to implement fixnums as
> ASCII characters efficiently.
> 
> But suppose one wants to have the character datatype be 32-bit Unicode
> characters?  Or worse yet, 35-bit Unicode characters?
> 
> At the same time, most characters in the system will of course not be
> wide.  What are the sane implementation strategies for this?

I'd have a fixed-width internal representation -- probably 32 bits 
although that's overkilling it by about a byte and a half, probably 
identical to some mapping of the unicode character set -- and then 
use i\o functions that were character-set aware and could translate 
to and from various character sets and representations.  

I wouldn't want to muck about internally with a format that had 
characters of various different widths: too much pain to implement, 
too many chances to introduce bugs, not enough space savings. 
Besides, when people read whole files as strings, do you really 
want to run through the whole string counting multi-byte characters 
and single-byte characters to find the value of an expression like 

(string-ref FOO charcount)  ;; lookups in a 32 million character string!

where charcount is large?  I don't.  Constant width means O(1) lookup 
time. 

If space is limited, or if you're doing very serious performance 
tuning, You might want to have two separate constant-width internal 
character representations, one for short characters (ascii or 16bit) 
and one for long (full unicode).  But if so, you're going to have to 
take it into account the extra space that will be used by the 
additional executable code in your character and string comparisons 
and manipulation functions, and deal with the increased complexity 
there. That would introduce some mild insanity and chances for a few 
bugs, but imo it's not as bad as variable-width characters. 

What is sane, however, depends deeply on what environment you expect 
to be in.  You have to ask yourself whether the scheme you're writing 
will be used with data in multiple character sets.  

For example, will users want to read strings in ebcdic and write 
them in unicode?  How about the multiple incompatible versions of 
ebcdic?  Do you have to support them, or can we let them die now? 
Will your implementation want to read and produce both UTF-8 and 
UTF-16 output?  Will you have to handle miscellaneous ISO character 
sets that have different characters mapped to the same character 
codes above 127?  Or obsolete ascii where the character code we 
use as backslash used to mean 1/8?  How about five-bit Baudot 
coding?  :-)

Get character i/o functions that do translation, and then the 
lookups and references and compares and everything just work for 
free with simple code, and all you have to do to support a new 
character set is to provide a new mapping that the i/o functions 
can use.
From: Andy Heninger
Subject: Re: Wide character implementation
Date: 
Message-ID: <mrfm8.14653$A%3.104796@ord-read.news.verio.net>
"Ray Dillinger" <····@sonic.net> wrote
> Get character i/o functions that do translation, and then the
> lookups and references and compares and everything just work for
> free with simple code, and all you have to do to support a new
> character set is to provide a new mapping that the i/o functions
> can use.

If you want to provide full up international support, the code for string
manipulatioin becomes anything but simple, no matter what your string
representation.  Think string compares that respect the cultural conventions
of different countries and languages (collation), for example.  And if
you're thinking Unicode, this is the direction you're headed.

See IBM's open source Unicode library for a good example of what's
involved -
http://oss.software.ibm.com/icu



   -- Andy Heninger
      ········@us.ibm.com
From: Ray Dillinger
Subject: Re: Wide character implementation
Date: 
Message-ID: <3C9A0967.EC7BFF6B@sonic.net>
Andy Heninger wrote:
> 
> "Ray Dillinger" <····@sonic.net> wrote
> 
> If you want to provide full up international support, the code for string
> manipulatioin becomes anything but simple, no matter what your string
> representation.  Think string compares that respect the cultural conventions
> of different countries and languages (collation), for example.  And if
> you're thinking Unicode, this is the direction you're headed.

I dunno. As implementor I want to make it *possible* to 
implement all the complications.  I want to take the major 
barriers out of the way and deal with encodings intelligently.  
I'm willing to leave presentation and non-default collation 
to the authors of language packages.  Let someone who knows 
and cares implement that as a library; I want to provide the 
foundation stones so that she can, and provide default 
semantics on anonymous characters (which, to me, includes 
anything outside of the latin, european, extended latin, 
and math planes) that are logical, consistent, and overridable.

Should the REPL rearrange itself to go top-char-to-bottom, 
right-column-to-left, with prompts appearing at the top, 
if someone has named their variables and defined their 
symbols with kanji characters instead of latin? It's an 
interesting thought.  Should program code go in boustophedron
(alternating left-to-right in rows from top down) if someone 
has named stuff using heiroglyphics? Um, maybe....  But is 
the scheme system really where that kind of support is 
needed, or would it just confuse people? And what's the 
indentation convention for boustophedron?

Maybe that last byte-and-a-half should be used for left-right 
and up-down and spacing properties and the scheme system itself 
ought to do all that stuff.  But it's not so important I'm 
going to implement it before, say, read-write invariance on 
procedure objects.

			Bear
From: Duane Rettig
Subject: Re: Wide character implementation
Date: 
Message-ID: <4n0x1vnwp.fsf@beta.franz.com>
"Andy Heninger" <·····@jtcsv.com> writes:

> "Ray Dillinger" <····@sonic.net> wrote
> > Get character i/o functions that do translation, and then the
> > lookups and references and compares and everything just work for
> > free with simple code, and all you have to do to support a new
> > character set is to provide a new mapping that the i/o functions
> > can use.

Even before our current verion of Allegro CL (6.1), we were
supporting external-formats to exactly that extent, and it has
been extendible (for the most part).  See

http://www.franz.com/support/documentation/6.0/doc/iacl.htm#locales-1

> If you want to provide full up international support, the code for string
> manipulatioin becomes anything but simple, no matter what your string
> representation.  Think string compares that respect the cultural conventions
> of different countries and languages (collation), for example.  And if
> you're thinking Unicode, this is the direction you're headed.
> 
> See IBM's open source Unicode library for a good example of what's
> involved -
> http://oss.software.ibm.com/icu

We incorporate a large amount of IBM's work (and other work, as well)
in our current localization support.  See

http://www.franz.com/support/documentation/6.1/doc/iacl.htm#localization-1

Note that we have chosen not to support LC_CTYPE and LC_MESSAGES at this time.
Also, LC_COLLATE is not supported for 6.1, but Unicode Collation Element
Tables (UCETs) will be supported for 6.2.

-- 
Duane Rettig          Franz Inc.            http://www.franz.com/ (www)
1995 University Ave Suite 275  Berkeley, CA 94704
Phone: (510) 548-3600; FAX: (510) 548-8253   ·····@Franz.COM (internet)
From: Brian Spilsbury
Subject: Re: Wide character implementation
Date: 
Message-ID: <f0f9d928.0203282307.700fb6cb@posting.google.com>
Ray Dillinger <····@sonic.net> wrote in message news:<·················@sonic.net>...
> 
> I wouldn't want to muck about internally with a format that had 
> characters of various different widths: too much pain to implement, 
> too many chances to introduce bugs, not enough space savings. 
> Besides, when people read whole files as strings, do you really 
> want to run through the whole string counting multi-byte characters 
> and single-byte characters to find the value of an expression like 
> 
> (string-ref FOO charcount)  ;; lookups in a 32 million character string!
> 
> where charcount is large?  I don't.  Constant width means O(1) lookup 
> time. 

Well, there are several mitigating factors and some issues with CL
which cause difficulties here.

If you consider your string as a sequence, then you can see that the
issues with variable width encodings produce a data-type which has the
access characteristics of a list.

The arguments for and against lists apply directly to variable-width
strings.

If we look at the use of strings it falls into two fairly distinct
categories;

(a) Iteration:
    Printing, writing, reading, appending, scanning, copying, etc.
(b) Random-Access:
    Randomly accessing characters.

Infact almost everything we do with strings is iterative (which makes
sense when you remember why strings are called strings).

The problem is that Cl has rather poor support for iterating
sequences.

If we considered a sequence to be addressed though two spaces, one
being Index-Space, and the other Point-Space we could avoid a lot of
these issues, and make lists more efficiently usable as sequences.

(elt seq index) would access the sequence though index space (which
might involve walking down a list N steps).
(elt-p seq point) would access the sequence though a point (which
would involve no traversal).

The trick to efficiently exploiting this then would be to get a point
from an index.

(dosequence (element point sequence)
  (when (char= element #\!)
    (setf (elt-p sequence point) #\$)))

for a fairly lame example.

with things like (subseq sequence :start-point a :end-point b) it
starts to become more flexible.

Or the ability to say (dosequence (element point sequence :start-point
point) ...) to allow the continuation of an iteration.

I'm not suggesting that this is an ideal solution, but it should at
least point out some inadequacies in the current model.

With appropriate primitives the wide-spread use of list-like strings
should not even be considered problematic, imho.

And in answer to the example above, I don't think that anyone would
suggest forcing someone to use a variable-width string representation
at all times. If random access to a particular string is important to
you, then a vector-like string is obviously the way to go.

Regards,

Brian