character classes

From: Vladimir V. Zolotych
Subject: character classes
Date: Fri, 23 Feb 2001 19:29:56 +0000
Message-ID: <3A96BA34.53888059@eurocom.od.ua>

Hello

Suppose I have

(defun foo (c)
  (ecase c
    (#\_ "underscore")
    (#\- "hyphen")
    ((#\a #\s #\d #\f) "asdf")))

(#\a #\s #\d #\f) fits well for small number of choices.
What is the conventional (or frequently used) way for such
things when choices can be automatically generated. E.g.
I'd like to return "asdf" for all english small letters.

Thanks in advance

-- 
Vladimir Zolotych                         ······@eurocom.od.ua

Re: character classes Jochen Schmidt
- Re: character classes Frank A. Adrian
  - Re: character classes Johann Hibschman
    - Re: character classes Hartmann Schaffer
      - Re: character classes Johann Hibschman
      - Re: character classes Stig Hemmer
        Re: character classes Johan Kullstam
        Re: character classes Tim Bradshaw
        Re: character classes Dirk Bernhardt
        Re: character classes H�kon Alstadheim
        Re: character classes Hartmann Schaffer
        Re: character classes Thomas F. Burdick
Re: character classes Vladimir V. Zolotych

From: Jochen Schmidt
Subject: Re: character classes
Date: Fri, 23 Feb 2001 19:28:21 +0000
Message-ID: <976dbj$n3hp3$1@ID-22205.news.dfncis.de>

Vladimir V. Zolotych wrote:

> Hello
> 
> Suppose I have
> 
> (defun foo (c)
>   (ecase c
>     (#\_ "underscore")
>     (#\- "hyphen")
>     ((#\a #\s #\d #\f) "asdf")))
> 
> (#\a #\s #\d #\f) fits well for small number of choices.
> What is the conventional (or frequently used) way for such
> things when choices can be automatically generated. E.g.
> I'd like to return "asdf" for all english small letters.

I do not know if I understand you right, but it is easy to write a Function 
(or macro) that generates a predicate-function that does the minimal amount 
of compares to find if a character is a member of some set of characters.
Remember that to check if a given character is a member of the english 
(latin???) small letters you need only to test if it's between the 
boundaries.

Regards,
Jochen

From: Frank A. Adrian
Subject: Re: character classes
Date: Sat, 24 Feb 2001 15:36:06 +0000
Message-ID: <3zQl6.365$kZ1.129680@news.uswest.net>

"Jochen Schmidt" <···@dataheaven.de> wrote in message
···················@ID-22205.news.dfncis.de...
> Remember that to check if a given character is a member of the english
> (latin???) small letters you need only to test if it's between the
> boundaries.

Actually, this is only true in certain character code sets.  Having recently
been working on an IBM AS/400 (no, not in Lisp), this point has been well
and fully driven home to me.  It also fails for the ISO wetsern code set
(sorry, don't remember the ISO spec number) as per your "latin???" comment,
as letters with cedillas, accent grave, etc. are not properly interspersed.

The standard also places no restriction on mapping of codes to characters
other than the following ordering predicates on their codes:

 A<B<C<D<E<F<G<H<I<J<K<L<M<N<O<P<Q<R<S<T<U<V<W<X<Y<Z
 a<b<c<d<e<f<g<h<i<j<k<l<m<n<o<p<q<r<s<t<u<v<w<x<y<z
 0<1<2<3<4<5<6<7<8<9
 either 9<A or Z<0
 either 9<a or z<0

Immediately following this in Section 13.1.6 of the Hyperspec comes the
explicit statement:

This implies that, for standard characters, alphabetic ordering holds within
each case (uppercase and lowercase), and that the numeric characters as a
group are not interleaved with alphabetic characters. However, the ordering
or possible interleaving of uppercase characters and lowercase characters is
implementation-defined.

In general, the best way to see if an item is a lower-case character is to
use the predicate lower-case-p.  The same statement applies for other
well-defined character categories for which the standard has provided us
handy (if not always so easy to remember) predicates.

Now, all that being said, given that the ISO Western character set (sorry, I
still don't remember the spec number) is a superset of 7-bit ASCII, and that
most implementations use this character code set as their base character
set, as a first order approximation, comparing the range with char< (or its
brethren) is usually safe :-).

faa

P.S.  Is it just me, or does anyone else think that the entire character set
issue has not been very well thought out with respect to programming in any
language?

From: Johann Hibschman
Subject: Re: character classes
Date: Sat, 24 Feb 2001 20:43:31 +0000
Message-ID: <mtk86fzua4.fsf@astron.berkeley.edu>

Erik Naggum writes:

>   I'd like Common Lisp to have the first real solution to this problem.  I
>   have worked on this off and on for a very long time, scrapping more
>   designs than I feel comfortable enumerating.  Breaking out of the notions
>   that are so hard-wired into our "modern" operating systems is very hard.

I've only started thinking about this, but could you give a brief
illustration of your state?  Site-dependent character sets, encodings,
and upper/lower mappings are simple enough, but variable-width
characters puzzle me.  By variable-width, I mean things such as accent
modifiers, which change the following character but are not characters
themselves, and the German ess-tset, which is a single character (B)
in lower-case, but two (SS) in upper.  (Pardon the lack of a real
character, but I have no idea how to generate it on my system.)

The seperate concepts of byte-streams and true strings seem important.
If a string is an array of characters, length is no longer a good
concept; does it refer to the number of characters, or to the length
of the textual representation?  Clearly, there should be two seperate
functions to decide this.

In addition, even the parsing of string literals is a hard thing.
Should "HEISSEN" be an array containing #\H #\E #\I #\SS #\E #\N, or
should the #\SS be #\S #\S instead?  I would argue that it depends on
the language encodings currently in operation, but that seems
difficult.

I'd be inclined to have a character be a conceptual character, with
all accents, state, and so on encoded within it.  But now I am
curious.  Where would I go to find out more about issues like this?

--Johann

-- 
Johann Hibschman                           ······@physics.berkeley.edu

From: Hartmann Schaffer
Subject: Re: character classes
Date: Sat, 24 Feb 2001 22:24:43 +0000
Message-ID: <slrn99g7al.25e.hs@paradise.nirvananet>

In article <··············@astron.berkeley.edu>, Johann Hibschman wrote:
> ...
>In addition, even the parsing of string literals is a hard thing.
>Should "HEISSEN" be an array containing #\H #\E #\I #\SS #\E #\N, or
>should the #\SS be #\S #\S instead?  I would argue that it depends on
>the language encodings currently in operation, but that seems
>difficult.

wouldn't that depend on which "heissen" you are parsing ("wie heissen sie?"
vs. "die heissen oefen")?

hs

From: Johann Hibschman
Subject: Re: character classes
Date: Sun, 25 Feb 2001 00:02:27 +0000
Message-ID: <mtk86ftyss.fsf@astron.berkeley.edu>

Hartmann Schaffer writes:

> In article <··············@astron.berkeley.edu>, Johann Hibschman wrote:

> wouldn't that depend on which "heissen" you are parsing ("wie heissen sie?"
> vs. "die heissen oefen")?

Good point.  I think you're right, but my German is very rusty.  Hm.

-- 
Johann Hibschman                           ······@physics.berkeley.edu

From: Stig Hemmer
Subject: Re: character classes
Date: Sun, 25 Feb 2001 12:46:58 +0000
Message-ID: <ekvhf1jj5fh.fsf@epoksy.pvv.ntnu.no>

··@paradise.nirvananet (Hartmann Schaffer) writes:
> wouldn't that depend on which "heissen" you are parsing ("wie heissen sie?"
> vs. "die heissen oefen")?

Norwegian has similar issues with our letters "���" (ae ligature, o
with slash across, a with ring above) which is sometimes transscribed
as "ae", "oe" and "aa".  And of course these letter combinations can
occur in other words as well...

After many years with all sorts of square wheels reinvented all over
the place, I think the only conclusion is that you should never try to
make the computer handle this is a "smart" manner.  It will always fail.

Stig Hemmer,
Jack of a Few Trades.

From: Johan Kullstam
Subject: Re: character classes
Date: Mon, 26 Feb 2001 04:42:31 +0000
Message-ID: <m24rxidpbk.fsf@euler.axel.nom>

Stig Hemmer <····@pvv.ntnu.no> writes:

> ··@paradise.nirvananet (Hartmann Schaffer) writes:
> > wouldn't that depend on which "heissen" you are parsing ("wie heissen sie?"
> > vs. "die heissen oefen")?
> 
> Norwegian has similar issues with our letters "���" (ae ligature, o
> with slash across, a with ring above) which is sometimes transscribed
> as "ae", "oe" and "aa".  And of course these letter combinations can
> occur in other words as well...

i don't know about norwegian, but swedes consider � � and � to be
distinct letters rather than modifications of roman letters.  the ae
oe and aa are just poor man's spellings for when you are stuck with a
limited set of characters.

i don't know what the official word is, but as i see it, the � is
really two esses.  if you look closely (depending on the font) you can
see a long (integral sign type) s followed by the usual s.  it's a
ligature in the same way a font merges f and i to be a unit "fi" in
which the f hangs over and becomes the dot to the i.  correct me if i
am wrong, but � isn't considered a distinct letter of its own.

Gau� and Gauss are the same person, and if you capitalize all the
letters, both become GAUSS.  it's kind of weird when capitalization
changes the number of characters you need.  i don't know how german
text would be best represented and manipulated given this quirk.  does
anyone have experience with this?

i don't see norwegian and swedish having this problem since capital �
is � &c.  they all take up the same amount of room.  and if you're
stuck with ae then of course AE would be available.  i don't think
it's common to have � but not �.

> After many years with all sorts of square wheels reinvented all over
> the place, I think the only conclusion is that you should never try to
> make the computer handle this is a "smart" manner.  It will always
> fail.

the other recourse is to abstraction.  sometimes WYIWIG just isn't
enough.  there's more behind what you see than just what you can see.
where you have a logical representation which indicates double ss and
then a printing mechanism which renders it as � or SS depending upon
capitalization.  it would presumably also keep distinct cases where
two ss just happen to come together but � would be illegal.

-- 
J o h a n  K u l l s t a m
[········@ne.mediaone.net]
Don't Fear the Penguin!

From: Tim Bradshaw
Subject: Re: character classes
Date: Mon, 26 Feb 2001 10:51:41 +0000
Message-ID: <nkjk86dra2q.fsf@tfeb.org>

Johan Kullstam <········@ne.mediaone.net> writes:

> 
> i don't know what the official word is, but as i see it, the � is
> really two esses.  if you look closely (depending on the font) you can
> see a long (integral sign type) s followed by the usual s.  it's a
> ligature in the same way a font merges f and i to be a unit "fi" in
> which the f hangs over and becomes the dot to the i.  correct me if i
> am wrong, but � isn't considered a distinct letter of its own.
> 
> Gau� and Gauss are the same person, and if you capitalize all the
> letters, both become GAUSS.  it's kind of weird when capitalization
> changes the number of characters you need.  i don't know how german
> text would be best represented and manipulated given this quirk.  does
> anyone have experience with this?
> 

If we're talking about the German letter that looks a bit like a
`beta' in many typefaces (I can't make it on this terminal for some
reason), then I thought it was historically an s-z ligature, not an
s-s ligature.  Indeed it kind of looks like that -- long-s with a z
against it.  A native German speaker can probably correct me.

--tim

From: Dirk Bernhardt
Subject: Re: character classes
Date: Mon, 26 Feb 2001 19:28:31 +0000
Message-ID: <87wvad4528.fsf@krid.de>

Tim Bradshaw <···@tfeb.org> writes:

> If we're talking about the German letter that looks a bit like a
> `beta' in many typefaces (I can't make it on this terminal for some
> reason), then I thought it was historically an s-z ligature, not an
> s-s ligature.  Indeed it kind of looks like that -- long-s with a z
> against it.  A native German speaker can probably correct me.

That's true.  Unfortunately, the capitalization of `�' is always `SS',
whereas the lowercase of `SS' is either `ss' or `�'.  Without knowledge of
the of the word's semantics you cannot decide which lowercase
representation to use.

Krid

From: H�kon Alstadheim
Subject: Re: character classes
Date: Mon, 26 Feb 2001 17:53:58 +0000
Message-ID: <m0ofvpmitl.fsf@alstadhome.cyberglobe.net>

Note: If emacs will stand by me, this message contains latin-1
characters coded according to iso-8859-1 , which is the de facto usenet
standard. The letters having char-code<128 coincide with ASCII.

Johan Kullstam <········@ne.mediaone.net> writes:

> Stig Hemmer <····@pvv.ntnu.no> writes:
>
> > ··@paradise.nirvananet (Hartmann Schaffer) writes:
> >
> > > wouldn't that depend on which "heissen" you are parsing ("wie
> > > heissen sie?" vs. "die heissen oefen")?
> > Norwegian has similar issues with our letters "���" (ae ligature,
> > o with slash across, a with ring above) which is sometimes
> > transscribed as "ae", "oe" and "aa". And of course these letter
> > combinations can occur in other words as well...
> 
> i don't know about norwegian, but swedes consider � � and � to be
> distinct letters rather than modifications of roman letters. the ae
> oe and aa are just poor man's spellings for when you are stuck with
> a limited set of characters.

Same in Norwegian, but the problems illustrate a point, namely that
ligatures that are more or less mandatory, or even proper letters in
their own right, have to be handled by people. They can not be dealt
with _correctly_ in a purely programmatic fashion. If one had an
exhaustive dictionary, one might come close, but no dictionary can
ever be exhaustive, because new words are coined all the time.

Examle: If you write Norwegian without having the letter � available,
you'd have:

- hoest for h�st (autumn)
- moene for m�ne (top of peaked roof)
- Moen for Moen (the name, note no change)

Going the other way (reinserting the proper �) would require human
intervention (possibly aided by dictionary), because the letters oe
appear as separate letters next to each other on occasion. A
dictionary alone would not be able to see the difference between m�ne
and moene ("the grasslands", sing.indef.:mo), so it would need to flag
collisions in some way for a human to resolve. These same
considerations hold for the German �, vs. two separate s-es that
happen to be next to each other.

This is just to illustrate a point, since these "ligatures" are
mandatory, and have codes in the common local character encodings. It
becomes interesting when you consider the f-ligatures, that are not
mandatory (except at good quality typesetters) , and which don't exist
on any ordinary keyboard.

Should a computer language care about those ligatures? I'd say maybe.
The f-ligatures also need to be verified by a human, but it should be
possible to represent them in a string, and leave it up to the display
engine whether to show one or two glyphs. Making it possible to
represent the ligatures in a string is one of the considerations when
designing the system for representing strings in a computer-program.

-- 
H�kon Alstadheim, Montreal, Quebec, Canada

From: Hartmann Schaffer
Subject: Re: character classes
Date: Tue, 27 Feb 2001 00:29:53 +0000
Message-ID: <slrn99lp4h.5tr.hs@paradise.nirvananet>

In article <··············@alstadhome.cyberglobe.net>, 
H�kon Alstadheim wrote:
> ...
>> > > wouldn't that depend on which "heissen" you are parsing ("wie
>> > > heissen sie?" vs. "die heissen oefen")?
>dictionary alone would not be able to see the difference between m�ne
>and moene ("the grasslands", sing.indef.:mo), so it would need to flag
>collisions in some way for a human to resolve. These same
>considerations hold for the German �, vs. two separate s-es that
>happen to be next to each other.

actually, as far as i remember (it has been too long), the "heissen"
examples are somwhat different.  in one case the "s-z" is mandatory (in
case you have it on your keyboard), in the other case it is an error.
and in the second case it is not an accidental justapposition.

> ...
>The f-ligatures also need to be verified by a human, but it should be
>possible to represent them in a string, and leave it up to the display
>engine whether to show one or two glyphs. Making it possible to
>represent the ligatures in a string is one of the considerations when
>designing the system for representing strings in a computer-program.

why?  this is simply a typographic convention that best be left to type-
setting programs.

hs

From: Thomas F. Burdick
Subject: Re: character classes
Date: Tue, 27 Feb 2001 01:13:43 +0000
Message-ID: <xcvhf1hvsfs.fsf@tempest.OCF.Berkeley.EDU>

··@paradise.nirvananet (Hartmann Schaffer) writes:

> >The f-ligatures also need to be verified by a human, but it should be
> >possible to represent them in a string, and leave it up to the display
> >engine whether to show one or two glyphs. Making it possible to
> >represent the ligatures in a string is one of the considerations when
> >designing the system for representing strings in a computer-program.
> 
> why?  this is simply a typographic convention that best be left to type-
> setting programs.

Actually, this is a more complicated problem than most people assume.
Not all instances of "ff", "fi", etc., should be ligated.  I can't
think of an example word off the top of my head, but this is mentioned
in the TeXBook, along with how to prevent automatic ligation
("affable" v. "af{}fable", I believe).  If it's possible for a program
to determine when f-ligatures are appropriate and when they aren't, I
don't think I've seen it implemented.  So it would actually seem (to
me anyway) to be useful to have both an "f" and an "ff" glyph -- then
display engines that do ff ligation could display the ligated form
when appropriate, and two f's when appropriate.  Display engines that
don't care could render both "ff" and two "f"s the same.

From: Vladimir V. Zolotych
Subject: Re: character classes
Date: Sat, 24 Feb 2001 16:23:52 +0000
Message-ID: <3A97E018.43788B8A@eurocom.od.ua>

Here is three version

I.

defun foo (c)
  (ecase c
    (#\_ "underscore")
    (#\- "hyphen")
    (#.(loop for i from (char-code #\A) to (char-code #\z) 
             collect (code-char i)) "asdf")))
(Thanks to Paul Foley. I'd never guessed myself about #.)

II.

(deftype underscore () `(eql #\_))
(deftype hyphen () `(eql #\-))
(deftype letter () `(member ,@(loop for i from (char-code #\A) to
(char-code #\z) 
                                    collect (code-char i))))

(defun foo2 (c)
  (etypecase c
    (underscore "underscore")
    (hyphen "hyphen")
    (letter "letter")))

III.

(defun foo3 (c)
  (cond ((char= c #\_) "underscore")
	((char= c #\-) "hyphen")
	((member c (loop for i from (char-code #\A) to (char-code #\z) 
                         collecting (code-char i))) "letter")))

Among those above which can be considered as good Lisp style ?

Another question: to compare two chars I have at least three
opportunity 
  char=
  eq
  eql
Seems eql is least efficient. What is the difference between
char= and eq for comparing chars? Should most specific
comparison be selected in all cases ? 

-- 
Vladimir Zolotych                         ······@eurocom.od.ua