From: Pascal Bourguignon
Subject: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <873bsmyodc.fsf@thalassa.informatimago.com>
As we all know, CLHS is not very formal, but would you say that
implementations should keep this invariant:

(for-all ch in character
  (or (not (digit-char-p ch))
      (= (digit-char-p ch) (read-from-string (string ch)))))


eg.:

(loop for i below char-code-limit
      for ch = (code-char i)
      do (assert (or (not (digit-char-p ch))
                     (eql (digit-char-p ch) (read-from-string (string ch))))
                 (ch)))



All but clisp keep this invariant. 

But then, only clisp digit-char-p returns true for non ASCII digits,
which I find useful, I only lament the breakin of this invariant...



[···@thalassa encours]$ sbcl --sysinit /dev/null --userinit /dev/null
* (loop for i below char-code-limit
      for ch = (code-char i)
      do (assert (or (not (digit-char-p ch))
                     (eql (digit-char-p ch) (read-from-string (string ch))))
                 (ch)))

NIL
* (sb-ext:quit)
[···@thalassa encours]$ cmucl -noinit 
CMU Common Lisp 18e-pre2 2003-03-25-003, running on thalassa
With core: /local/languages/cmucl-18e/lib/cmucl/lib/lisp.core
Dumped on: Wed, 2003-03-26 00:53:00+01:00 on orion
See <http://www.cons.org/cmucl/> for support information.
Loaded subsystems:
    Python 1.1, target Intel x86
    CLOS 18e (based on PCL September 16 92 PCL (f))
* (loop for i below char-code-limit
      for ch = (code-char i)
      do (assert (or (not (digit-char-p ch))
                     (eql (digit-char-p ch) (read-from-string (string ch))))
                 (ch)))

NIL
* (ext:quit)
[···@thalassa encours]$ /local/src/ecl-0.9/build/ecl
ECL (Embeddable Common-Lisp) 0.9
Copyright (C) 1984 Taiichi Yuasa and Masami Hagiya
Copyright (C) 1993 Giuseppe Attardi
Copyright (C) 2000 Juan J. Garcia-Ripoll
	ECL is free software, and you are welcome to redistribute it
under certain conditions; see file 'Copyright' for details.
Type :h for Help.  Top level.
> (loop for i below char-code-limit
      for ch = (code-char i)
      do (assert (or (not (digit-char-p ch))
                     (eql (digit-char-p ch) (read-from-string (string ch))))
                 (ch)))
NIL
> (quit)
[···@thalassa encours]$ gcl
GCL (GNU Common Lisp)  2.6.3 CLtL1   Jul 16 2004 20:19:02
Source License: LGPL(gcl,gmp), GPL(unexec,bfd)
Binary License:  GPL due to GPL'ed components: (READLINE BFD UNEXEC)
Modifications of this banner must retain notice of a compatible license
Dedicated to the memory of W. Schelter

Use (help) to get some basic information on how to use GCL.

>(loop for i below char-code-limit
      for ch = (code-char i)
      do (assert (or (not (digit-char-p ch))
                     (eql (digit-char-p ch) (read-from-string (string ch))))
                 (ch)))

NIL

>(quit)
[···@thalassa encours]$ clisp -q -norc
[1]> (loop for i below char-code-limit
      for ch = (code-char i)
      do (assert (or (not (digit-char-p ch))
                     (eql (digit-char-p ch) (read-from-string (string ch))))
                 (ch)))

** - Continuable Error
(OR (NOT (DIGIT-CHAR-P CH))
     (EQL (DIGIT-CHAR-P CH) (READ-FROM-STRING (STRING CH)))) must evaluate to a
     non-NIL value.
If you continue (by typing 'continue'): You may input a new value for CH.
The following restarts are also available:
ABORT          :R1      ABORT
Break 1 [2]> (ext:quit)
[···@thalassa encours]$ 

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

The world will now reboot.  don't bother saving your artefacts.

From: Sam Steingold
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <uvf5hyl4u.fsf@gnu.org>
> * Pascal Bourguignon <···@vasbezngvzntb.pbz> [2005-05-17 15:26:55 +0200]:
>
> As we all know, CLHS is not very formal, but would you say that
> implementations should keep this invariant:
>
> (for-all ch in character
>   (or (not (digit-char-p ch))
>       (= (digit-char-p ch) (read-from-string (string ch)))))
>
>
> eg.:
>
> (loop for i below char-code-limit
>       for ch = (code-char i)
>       do (assert (or (not (digit-char-p ch))
>                      (eql (digit-char-p ch) (read-from-string (string ch))))
>                  (ch)))
>
>
>
> All but clisp keep this invariant. 
>
> But then, only clisp digit-char-p returns true for non ASCII digits,
> which I find useful, I only lament the breakin of this invariant...

Specifically,

(loop for i below char-code-limit
   for ch = (code-char i)
   unless (or (not (digit-char-p ch))
              (eql (digit-char-p ch) (read-from-string (string ch))))
   collect ch)
==>
(#\ARABIC-INDIC_DIGIT_ZERO #\ARABIC-INDIC_DIGIT_ONE #\ARABIC-INDIC_DIGIT_TWO
 #\ARABIC-INDIC_DIGIT_THREE #\ARABIC-INDIC_DIGIT_FOUR #\ARABIC-INDIC_DIGIT_FIVE
 #\ARABIC-INDIC_DIGIT_SIX #\ARABIC-INDIC_DIGIT_SEVEN #\ARABIC-INDIC_DIGIT_EIGHT
 #\ARABIC-INDIC_DIGIT_NINE #\EXTENDED_ARABIC-INDIC_DIGIT_ZERO
 #\EXTENDED_ARABIC-INDIC_DIGIT_ONE #\EXTENDED_ARABIC-INDIC_DIGIT_TWO
 #\EXTENDED_ARABIC-INDIC_DIGIT_THREE #\EXTENDED_ARABIC-INDIC_DIGIT_FOUR
 #\EXTENDED_ARABIC-INDIC_DIGIT_FIVE #\EXTENDED_ARABIC-INDIC_DIGIT_SIX
 #\EXTENDED_ARABIC-INDIC_DIGIT_SEVEN #\EXTENDED_ARABIC-INDIC_DIGIT_EIGHT
 #\EXTENDED_ARABIC-INDIC_DIGIT_NINE #\DEVANAGARI_DIGIT_ZERO
 #\DEVANAGARI_DIGIT_ONE #\DEVANAGARI_DIGIT_TWO #\DEVANAGARI_DIGIT_THREE
 #\DEVANAGARI_DIGIT_FOUR #\DEVANAGARI_DIGIT_FIVE #\DEVANAGARI_DIGIT_SIX
 #\DEVANAGARI_DIGIT_SEVEN #\DEVANAGARI_DIGIT_EIGHT #\DEVANAGARI_DIGIT_NINE
 #\BENGALI_DIGIT_ZERO #\BENGALI_DIGIT_ONE #\BENGALI_DIGIT_TWO
 #\BENGALI_DIGIT_THREE #\BENGALI_DIGIT_FOUR #\BENGALI_DIGIT_FIVE
 #\BENGALI_DIGIT_SIX #\BENGALI_DIGIT_SEVEN #\BENGALI_DIGIT_EIGHT
 #\BENGALI_DIGIT_NINE #\GURMUKHI_DIGIT_ZERO #\GURMUKHI_DIGIT_ONE
 #\GURMUKHI_DIGIT_TWO #\GURMUKHI_DIGIT_THREE #\GURMUKHI_DIGIT_FOUR
 #\GURMUKHI_DIGIT_FIVE #\GURMUKHI_DIGIT_SIX #\GURMUKHI_DIGIT_SEVEN
 #\GURMUKHI_DIGIT_EIGHT #\GURMUKHI_DIGIT_NINE #\GUJARATI_DIGIT_ZERO
 #\GUJARATI_DIGIT_ONE #\GUJARATI_DIGIT_TWO #\GUJARATI_DIGIT_THREE
 #\GUJARATI_DIGIT_FOUR #\GUJARATI_DIGIT_FIVE #\GUJARATI_DIGIT_SIX
 #\GUJARATI_DIGIT_SEVEN #\GUJARATI_DIGIT_EIGHT #\GUJARATI_DIGIT_NINE
 #\ORIYA_DIGIT_ZERO #\ORIYA_DIGIT_ONE #\ORIYA_DIGIT_TWO #\ORIYA_DIGIT_THREE
 #\ORIYA_DIGIT_FOUR #\ORIYA_DIGIT_FIVE #\ORIYA_DIGIT_SIX #\ORIYA_DIGIT_SEVEN
 #\ORIYA_DIGIT_EIGHT #\ORIYA_DIGIT_NINE #\TAMIL_DIGIT_ONE #\TAMIL_DIGIT_TWO
 #\TAMIL_DIGIT_THREE #\TAMIL_DIGIT_FOUR #\TAMIL_DIGIT_FIVE #\TAMIL_DIGIT_SIX
 #\TAMIL_DIGIT_SEVEN #\TAMIL_DIGIT_EIGHT #\TAMIL_DIGIT_NINE #\TELUGU_DIGIT_ZERO
 #\TELUGU_DIGIT_ONE #\TELUGU_DIGIT_TWO #\TELUGU_DIGIT_THREE #\TELUGU_DIGIT_FOUR
 #\TELUGU_DIGIT_FIVE #\TELUGU_DIGIT_SIX #\TELUGU_DIGIT_SEVEN
 #\TELUGU_DIGIT_EIGHT #\TELUGU_DIGIT_NINE #\KANNADA_DIGIT_ZERO
 #\KANNADA_DIGIT_ONE #\KANNADA_DIGIT_TWO #\KANNADA_DIGIT_THREE
 #\KANNADA_DIGIT_FOUR #\KANNADA_DIGIT_FIVE #\KANNADA_DIGIT_SIX
 #\KANNADA_DIGIT_SEVEN #\KANNADA_DIGIT_EIGHT #\KANNADA_DIGIT_NINE
 #\MALAYALAM_DIGIT_ZERO #\MALAYALAM_DIGIT_ONE #\MALAYALAM_DIGIT_TWO
 #\MALAYALAM_DIGIT_THREE #\MALAYALAM_DIGIT_FOUR #\MALAYALAM_DIGIT_FIVE
 #\MALAYALAM_DIGIT_SIX #\MALAYALAM_DIGIT_SEVEN #\MALAYALAM_DIGIT_EIGHT
 #\MALAYALAM_DIGIT_NINE #\THAI_DIGIT_ZERO #\THAI_DIGIT_ONE #\THAI_DIGIT_TWO
 #\THAI_DIGIT_THREE #\THAI_DIGIT_FOUR #\THAI_DIGIT_FIVE #\THAI_DIGIT_SIX
 #\THAI_DIGIT_SEVEN #\THAI_DIGIT_EIGHT #\THAI_DIGIT_NINE #\LAO_DIGIT_ZERO
 #\LAO_DIGIT_ONE #\LAO_DIGIT_TWO #\LAO_DIGIT_THREE #\LAO_DIGIT_FOUR
 #\LAO_DIGIT_FIVE #\LAO_DIGIT_SIX #\LAO_DIGIT_SEVEN #\LAO_DIGIT_EIGHT
 #\LAO_DIGIT_NINE #\TIBETAN_DIGIT_ZERO #\TIBETAN_DIGIT_ONE #\TIBETAN_DIGIT_TWO
 #\TIBETAN_DIGIT_THREE #\TIBETAN_DIGIT_FOUR #\TIBETAN_DIGIT_FIVE
 #\TIBETAN_DIGIT_SIX #\TIBETAN_DIGIT_SEVEN #\TIBETAN_DIGIT_EIGHT
 #\TIBETAN_DIGIT_NINE #\MYANMAR_DIGIT_ZERO #\MYANMAR_DIGIT_ONE
 #\MYANMAR_DIGIT_TWO #\MYANMAR_DIGIT_THREE #\MYANMAR_DIGIT_FOUR
 #\MYANMAR_DIGIT_FIVE #\MYANMAR_DIGIT_SIX #\MYANMAR_DIGIT_SEVEN
 #\MYANMAR_DIGIT_EIGHT #\MYANMAR_DIGIT_NINE #\ETHIOPIC_DIGIT_ONE
 #\ETHIOPIC_DIGIT_TWO #\ETHIOPIC_DIGIT_THREE #\ETHIOPIC_DIGIT_FOUR
 #\ETHIOPIC_DIGIT_FIVE #\ETHIOPIC_DIGIT_SIX #\ETHIOPIC_DIGIT_SEVEN
 #\ETHIOPIC_DIGIT_EIGHT #\ETHIOPIC_DIGIT_NINE #\KHMER_DIGIT_ZERO
 #\KHMER_DIGIT_ONE #\KHMER_DIGIT_TWO #\KHMER_DIGIT_THREE #\KHMER_DIGIT_FOUR
 #\KHMER_DIGIT_FIVE #\KHMER_DIGIT_SIX #\KHMER_DIGIT_SEVEN #\KHMER_DIGIT_EIGHT
 #\KHMER_DIGIT_NINE #\MONGOLIAN_DIGIT_ZERO #\MONGOLIAN_DIGIT_ONE
 #\MONGOLIAN_DIGIT_TWO #\MONGOLIAN_DIGIT_THREE #\MONGOLIAN_DIGIT_FOUR
 #\MONGOLIAN_DIGIT_FIVE #\MONGOLIAN_DIGIT_SIX #\MONGOLIAN_DIGIT_SEVEN
 #\MONGOLIAN_DIGIT_EIGHT #\MONGOLIAN_DIGIT_NINE #\FULLWIDTH_DIGIT_ZERO
 #\FULLWIDTH_DIGIT_ONE #\FULLWIDTH_DIGIT_TWO #\FULLWIDTH_DIGIT_THREE
 #\FULLWIDTH_DIGIT_FOUR #\FULLWIDTH_DIGIT_FIVE #\FULLWIDTH_DIGIT_SIX
 #\FULLWIDTH_DIGIT_SEVEN #\FULLWIDTH_DIGIT_EIGHT #\FULLWIDTH_DIGIT_NINE
 #\MATHEMATICAL_BOLD_DIGIT_ZERO #\MATHEMATICAL_BOLD_DIGIT_ONE
 #\MATHEMATICAL_BOLD_DIGIT_TWO #\MATHEMATICAL_BOLD_DIGIT_THREE
 #\MATHEMATICAL_BOLD_DIGIT_FOUR #\MATHEMATICAL_BOLD_DIGIT_FIVE
 #\MATHEMATICAL_BOLD_DIGIT_SIX #\MATHEMATICAL_BOLD_DIGIT_SEVEN
 #\MATHEMATICAL_BOLD_DIGIT_EIGHT #\MATHEMATICAL_BOLD_DIGIT_NINE
 #\MATHEMATICAL_DOUBLE-STRUCK_DIGIT_ZERO #\MATHEMATICAL_DOUBLE-STRUCK_DIGIT_ONE
 #\MATHEMATICAL_DOUBLE-STRUCK_DIGIT_TWO
 #\MATHEMATICAL_DOUBLE-STRUCK_DIGIT_THREE
 #\MATHEMATICAL_DOUBLE-STRUCK_DIGIT_FOUR
 #\MATHEMATICAL_DOUBLE-STRUCK_DIGIT_FIVE #\MATHEMATICAL_DOUBLE-STRUCK_DIGIT_SIX
 #\MATHEMATICAL_DOUBLE-STRUCK_DIGIT_SEVEN
 #\MATHEMATICAL_DOUBLE-STRUCK_DIGIT_EIGHT
 #\MATHEMATICAL_DOUBLE-STRUCK_DIGIT_NINE #\MATHEMATICAL_SANS-SERIF_DIGIT_ZERO
 #\MATHEMATICAL_SANS-SERIF_DIGIT_ONE #\MATHEMATICAL_SANS-SERIF_DIGIT_TWO
 #\MATHEMATICAL_SANS-SERIF_DIGIT_THREE #\MATHEMATICAL_SANS-SERIF_DIGIT_FOUR
 #\MATHEMATICAL_SANS-SERIF_DIGIT_FIVE #\MATHEMATICAL_SANS-SERIF_DIGIT_SIX
 #\MATHEMATICAL_SANS-SERIF_DIGIT_SEVEN #\MATHEMATICAL_SANS-SERIF_DIGIT_EIGHT
 #\MATHEMATICAL_SANS-SERIF_DIGIT_NINE #\MATHEMATICAL_SANS-SERIF_BOLD_DIGIT_ZERO
 #\MATHEMATICAL_SANS-SERIF_BOLD_DIGIT_ONE
 #\MATHEMATICAL_SANS-SERIF_BOLD_DIGIT_TWO
 #\MATHEMATICAL_SANS-SERIF_BOLD_DIGIT_THREE
 #\MATHEMATICAL_SANS-SERIF_BOLD_DIGIT_FOUR
 #\MATHEMATICAL_SANS-SERIF_BOLD_DIGIT_FIVE
 #\MATHEMATICAL_SANS-SERIF_BOLD_DIGIT_SIX
 #\MATHEMATICAL_SANS-SERIF_BOLD_DIGIT_SEVEN
 #\MATHEMATICAL_SANS-SERIF_BOLD_DIGIT_EIGHT
 #\MATHEMATICAL_SANS-SERIF_BOLD_DIGIT_NINE #\MATHEMATICAL_MONOSPACE_DIGIT_ZERO
 #\MATHEMATICAL_MONOSPACE_DIGIT_ONE #\MATHEMATICAL_MONOSPACE_DIGIT_TWO
 #\MATHEMATICAL_MONOSPACE_DIGIT_THREE #\MATHEMATICAL_MONOSPACE_DIGIT_FOUR
 #\MATHEMATICAL_MONOSPACE_DIGIT_FIVE #\MATHEMATICAL_MONOSPACE_DIGIT_SIX
 #\MATHEMATICAL_MONOSPACE_DIGIT_SEVEN #\MATHEMATICAL_MONOSPACE_DIGIT_EIGHT
 #\MATHEMATICAL_MONOSPACE_DIGIT_NINE)

(238 digits)

These are Unicode characters that have the "digit" Unicode attribute.

CLTS:

      digit n. (in a radix) a character that is among the possible
      digits (0 to 9, A to Z, and a to z) and that is defined to have an
      associated numeric weight as a digit in that radix. See Section
      13.1.4.6 (Digits in a Radix).

<http://www.lisp.org/HyperSpec/Body/sec_13-1-4-6.html> appears to be
fairly specific: only the standard ASCII characters are potential
digits.  Therefore the Unicode characters with the digit attribute are
numeric characters but not digits.

Suppose we try to keep the Pascal's invariant in all radixes.

#\MATHEMATICAL_BOLD_DIGIT_NINE is 9 in radix 16.
How about #\MATHEMATICAL_BOLD_CAPITAL_A in radix 16?
It would see reasonable to expect it to be 10.
(yes, we can use the "character decomposition" to map
#\MATHEMATICAL_BOLD_CAPITAL_A to #\A, so, it is possible to arrange that
   (let ((*read-base* 16))
     (read-from-string
      (concatenate 'string
        (string #\MATHEMATICAL_BOLD_CAPITAL_A)
        (string #\MATHEMATICAL_BOLD_DIGIT_NINE))))
returns #xA9).

Now, how about the first letter of the ETHIOPIC alphabet?
How about alphabets with fewer than 26 letters?
More than 26 letters?  - why not extend the notion of a radix?

Then, how about #\ETHIOPIC_NUMBER_TEN?
#\ETHIOPIC_NUMBER_THIRTY?
#\ETHIOPIC_NUMBER_HUNDRED?
What should
     (read-from-string
      (concatenate 'string
        (string #\ETHIOPIC_NUMBER_HUNDRED)
        (string #\ETHIOPIC_NUMBER_TEN)
        (string #\ETHIOPIC_DIGIT_NINE)))
return? 119? 100109?  (the Ethiopic system is not positional).

Or even funnier:
should

     (read-from-string
      (concatenate 'string
        (string #\ARABIC-INDIC_DIGIT_ONE)
        (string #\DEVANAGARI_DIGIT_TWO)
        (string #\BENGALI_DIGIT_THREE)
        (string #\GUJARATI_DIGIT_FOUR)
        (string #\TAMIL_DIGIT_FIVE)))

return 12345?

My points are:

1. Pascal's invariant is not required by the CLTS.

2. Requiring Pascal's invariant would produce weird results in
   implementations that use Unicode.

3. The current situation in CLISP allows users to parse Unicode text
   (possibly interpreting numbers &c) by using Unicode attributes
   because DIGIT-CHAR-P returns useful values for Unicode digits.

-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.memri.org/> <http://www.camera.org> <http://ffii.org/>
<http://www.palestinefacts.org/> <http://pmw.org.il/>
Who is General Failure and why is he reading my hard disk?
From: ··············@hotmail.com
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <1116345885.036479.229380@g47g2000cwa.googlegroups.com>
Sam Steingold wrote:
> > * Pascal Bourguignon <···@vasbezngvzntb.pbz> [2005-05-17 15:26:55
+0200]:
> >
> > As we all know, CLHS is not very formal, but would you say that
> > implementations should keep this invariant:
> >
> > (for-all ch in character
> >   (or (not (digit-char-p ch))
> >       (= (digit-char-p ch) (read-from-string (string ch)))))
> >

> >
> >
> > All but clisp keep this invariant.
> >
> > But then, only clisp digit-char-p returns true for non ASCII
digits,
> > which I find useful, I only lament the breakin of this invariant...
>
> Specifically,
>
> (loop for i below char-code-limit
>    for ch = (code-char i)
>    unless (or (not (digit-char-p ch))
>               (eql (digit-char-p ch) (read-from-string (string ch))))
>    collect ch)
> ==>

>
> Suppose we try to keep the Pascal's invariant in all radixes.
>
> #\MATHEMATICAL_BOLD_DIGIT_NINE is 9 in radix 16.
> How about #\MATHEMATICAL_BOLD_CAPITAL_A in radix 16?
> It would see reasonable to expect it to be 10.
> (yes, we can use the "character decomposition" to map
> #\MATHEMATICAL_BOLD_CAPITAL_A to #\A, so, it is possible to arrange
that
>    (let ((*read-base* 16))
>      (read-from-string
>       (concatenate 'string
>         (string #\MATHEMATICAL_BOLD_CAPITAL_A)
>         (string #\MATHEMATICAL_BOLD_DIGIT_NINE))))
> returns #xA9).
>

The somewhat-serious problem is exactly the converse of the one you
state.

There is no problem with reading #\A and having the value disagree,
because the proposed invariant involves only digit-char-p. If
(digit-char-p #\A) is false, then the value given by reading it in does
not matter.

However, if the radix is LESS THAN 10 (decimal), then we have a
problem. The digits >= *read-base* will not be read in numerically, but
rather as a symbol.

(let ((*read-base* 8)) (read-from-string "8")) ==> |8|

(let ((*read-base* 16)) (read-from-string "8")) ==> 8

I should make clear that I call this issue only somewhat-serious,
because it could be patched by making the invariant involve
(digit-char-p ch *read-base*).

(let ((*read-base* 16)) (read-from-string "A" *read-base*)) ==> 10

I think it might make sense to have all characters which are recognized
as digit-char-p also result in numbers when the reader sees them, as
long as the radix allows them to be read as numbers. That's clearly the
reasonable use for digit-char-p. Any parser that asks for digit-char-p
is almost certainly going to rely on the returned value to calculate
the represented value. If CLISP went to the trouble of specifying the
numerical values for digit-char-p to return (and I assume it chose
reasonably?) why shouldn't the Lisp reader agree with that numerical
value?

Otherwise, users of digit-char-p, if they want to store the original
strings, while preserving their numeric values, will have to sterilize
them by replacing digits with ASCII equivalents if they will ever be
seen by the reader. Perhaps this is good practice anyway, because other
programs/platforms/Lisp implementations might not be as inclusive.

Given that reality, the CLISP feature of being inclusive with
digit-char-p seems to introduce risk without a clear benefit. It seems
more likely that Unicode digits will be generated by accident than by
intent.

However, I have zero experience with this kind of
internationalization/character-coding issue, so I have no way to gauge
the true impact.
From: Sam Steingold
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <ufywlygeg.fsf@gnu.org>
> * ··············@hotmail.com <············@tznvy.pbz> [2005-05-17 09:04:45 -0700]:
>
> I think it might make sense to have all characters which are
> recognized as digit-char-p also result in numbers when the reader sees
> them, as long as the radix allows them to be read as numbers.

this is not what CLTS says.

I repeat:
what should

     (read-from-string
      (concatenate 'string
        (string #\ARABIC-INDIC_DIGIT_ONE)
        (string #\DEVANAGARI_DIGIT_TWO)
        (string #\BENGALI_DIGIT_THREE)
        (string #\GUJARATI_DIGIT_FOUR)
        (string #\TAMIL_DIGIT_FIVE)))

return?
Do you seriously expect such string to mean 12345?


> Given that reality, the CLISP feature of being inclusive with
> digit-char-p seems to introduce risk without a clear benefit. It seems
> more likely that Unicode digits will be generated by accident than by
> intent.

I don't see the risk.
the benefit is that the user will be able to parse Unicode text and
decide what is a word and what is a number - and maybe implement a NLP
parser.

people who try to use CL _reader_ to do NLP will lose anyway, regardless
of whether the text is ASCII or UNICODE or whether Unicode digits are
read as numbers by READ.

-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.honestreporting.com> <http://www.dhimmi.com/>
<http://www.memri.org/> <http://www.mideasttruth.com/> <http://pmw.org.il/>
cogito cogito ergo cogito sum
From: ··············@hotmail.com
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <1116348974.428855.96620@g49g2000cwa.googlegroups.com>
Sam Steingold wrote:
> > * ··············@hotmail.com <············@tznvy.pbz> [2005-05-17
09:04:45 -0700]:
> >
> > I think it might make sense to have all characters which are
> > recognized as digit-char-p also result in numbers when the reader
sees
> > them, as long as the radix allows them to be read as numbers.
>
> this is not what CLTS says.

I don't dispute that.

> I repeat:
> what should
>
>      (read-from-string
>       (concatenate 'string
>         (string #\ARABIC-INDIC_DIGIT_ONE)
>         (string #\DEVANAGARI_DIGIT_TWO)
>         (string #\BENGALI_DIGIT_THREE)
>         (string #\GUJARATI_DIGIT_FOUR)
>         (string #\TAMIL_DIGIT_FIVE)))
>
> return?
> Do you seriously expect such string to mean 12345?
>

Why not? How else would the producer of such a string mean it to be
interpreted? Is it not just as unreasonable for the producer to
deliberately intend such a Lisp symbol? You've deliberately chosen an
edge case, so who would reasonably *rely* on behavior either way?

> the benefit is that the user will be able to parse Unicode text and
> decide what is a word and what is a number - and maybe implement a
NLP
> parser.

"What is a word" and "What is a number" is an application- or
domain-specific question, which cannot be answered in a language spec
or by an implementation. An implementation should do its best to stay
out of the way. Is a U.S. Postal ZIP code a number? Unfortunately, a
lot of tools try to answer "yes," screwing up royally when they see New
England ZIP codes which start in zero.

Anyway, in the original context, once digit-char-p starts declaring
things "numeric" there is always a danger such characters will get
treated in ways that a naive Lisp program might be trying to mimic
another program, and that other program may use the Lisp reader or
something similar. Instead of requiring every user of digit-char-p to
sterilize his data, it would be better for a truly Unicode-aware
application to explicitly decide how permissive to be about numeric
characters. "digit-char-p" is only one function---you can't expect it
to deal with every possibility that human typists or coders can
generate. Best expect it to only deal with the Lisp context; letting it
agree with the reader seems a reasonable (if admittedly non-conforming)
approach.

Common Lisp did not attempt to be Unicode-aware (only relatively
agnostic), and it probably cannot do so to some ultimate extreme and
still remain within the original standard. The current standard could
not possibly have cleanly specified (with Unicode in the future) where
the line should properly be drawn between 7-bit-ASCII-only and
all-Unicode-all-the-time.

> people who try to use CL _reader_ to do NLP will lose anyway,
regardless
> of whether the text is ASCII or UNICODE or whether Unicode digits are
> read as numbers by READ.
>
> --
> Sam Steingold (http://www.podval.org/~sds) running w2k
> <http://www.honestreporting.com> <http://www.dhimmi.com/>
> <http://www.memri.org/> <http://www.mideasttruth.com/>
<http://pmw.org.il/>
> cogito cogito ergo cogito sum
From: Sam Steingold
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <uwtpxwycz.fsf@gnu.org>
> * ··············@hotmail.com <············@tznvy.pbz> [2005-05-17
> 09:56:14 -0700]:
>
>> what should
>>
>>      (read-from-string
>>       (concatenate 'string
>>         (string #\ARABIC-INDIC_DIGIT_ONE)
>>         (string #\DEVANAGARI_DIGIT_TWO)
>>         (string #\BENGALI_DIGIT_THREE)
>>         (string #\GUJARATI_DIGIT_FOUR)
>>         (string #\TAMIL_DIGIT_FIVE)))
>>
>> return?
>> Do you seriously expect such string to mean 12345?
>>
>
> Why not? How else would the producer of such a string mean it to be
> interpreted? Is it not just as unreasonable for the producer to
> deliberately intend such a Lisp symbol? You've deliberately chosen an
> edge case, so who would reasonably *rely* on behavior either way?

CL reader parses as numbers things that "look like a number".
no one will look at the string above and say "yeah, that's a number".

> "What is a word" and "What is a number" is an application- or
> domain-specific question, which cannot be answered in a language spec
> or by an implementation.

this is precisely why the CL reader should _not_ interpret the above
string as a number: because CL reader operates in the CL domain where
the above is not a number as per the CL syntax.

OTOH, (DIGIT-CHAR-P #\DEVANAGARI_DIGIT_TWO) returning 2 is useful
because this is a domain-neutral issue of the nature of the Unicode
character in question.

> Anyway, in the original context, once digit-char-p starts declaring
> things "numeric" there is always a danger such characters will get
> treated in ways that a naive Lisp program might be trying to mimic
> another program, and that other program may use the Lisp reader or
> something similar.

Lisp reader is for Lisp data (including Lisp code).
[yes, it is extensible, but it is extensible to incorporate "Lisp-like"
data, not "every natural syntax you can imagine"; it is relatively easy
to make READ parse XML (CLOCC/CLLIB/xml.lisp), but not C]

There is no way to tell the CL reader to
print 2 as (string #\DEVANAGARI_DIGIT_TWO),
thus there is no reason to read 2 from (string #\DEVANAGARI_DIGIT_TWO).

I hope we all agree on this.

> Instead of requiring every user of digit-char-p to sterilize his data,

what do you mean?
if your data contains Unicode characters, you should know about Unicode.

In Unicode, #\DEVANAGARI_DIGIT_TWO is a digit, and its weight is 2.
[it's not like CLISP is searching for substrings "TWO" in character
names :-)]
This is the same level statement as "in CL, (CAR NIL) returns NIL".
If you do not like what the Unicode international standard says, don't
use Unicode, use ASCII (yes, you can build CLISP in ASCII mode).
If you do not like (CAR NIL) ==> NIL, don't use CL, use Scheme.



-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.iris.org.il> <http://www.dhimmi.com/> <http://www.camera.org>
<http://www.memri.org/> <http://www.palestinefacts.org/>
Those who value Life above Freedom are destined to lose both.
From: Pascal Bourguignon
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <87fywly3cr.fsf@thalassa.informatimago.com>
···············@hotmail.com" <············@gmail.com> writes:
> Anyway, in the original context, once digit-char-p starts declaring
> things "numeric" there is always a danger such characters will get
> treated in ways that a naive Lisp program might be trying to mimic
> another program, and that other program may use the Lisp reader or
> something similar. Instead of requiring every user of digit-char-p to
> sterilize his data, it would be better for a truly Unicode-aware
> application to explicitly decide how permissive to be about numeric
> characters. "digit-char-p" is only one function---you can't expect it
> to deal with every possibility that human typists or coders can
> generate. Best expect it to only deal with the Lisp context; letting it
> agree with the reader seems a reasonable (if admittedly non-conforming)
> approach.

Indeed, I would not be surprized to read (or would naturally write):
  
    (when (every (function digit-char-p) string)
      (assert (integerp (parse-integer string :junk-allowed nil))))


Of course, one could argue that the correct way would be to write:

   (assert (typep (handler-case (parse-integer string :junk-allowed-nil)
                      (error () nil))
                  '(or null integer)))


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
From: Sam Steingold
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <ur7g4wstv.fsf@gnu.org>
> * Pascal Bourguignon <···@vasbezngvzntb.pbz> [2005-05-17 23:00:52 +0200]:
>
>     (when (every (function digit-char-p) string)
>       (assert (integerp (parse-integer string :junk-allowed nil))))

(and (every (function digit-char-p) string)
     (parse-integer string))

-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.mideasttruth.com/> <http://www.iris.org.il>
<http://www.dhimmi.com/> <http://www.palestinefacts.org/>
My inferiority complex is not as good as yours.
From: Pascal Bourguignon
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <87mzqswrvz.fsf@thalassa.informatimago.com>
Sam Steingold <···@gnu.org> writes:

>> * Pascal Bourguignon <···@vasbezngvzntb.pbz> [2005-05-17 23:00:52 +0200]:
>>
>>     (when (every (function digit-char-p) string)
>>       (assert (integerp (parse-integer string :junk-allowed nil))))
>
> (and (every (function digit-char-p) string)
>      (parse-integer string))

Yes, when there's only one expression in the body, and that expression
returns only one value, WHEN and AND are equivalent.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

There is no worse tyranny than to force a man to pay for what he does not
want merely because you think it would be good for him. -- Robert Heinlein
From: Sam Steingold
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <uzmusv41d.fsf@gnu.org>
> * Pascal Bourguignon <···@vasbezngvzntb.pbz> [2005-05-18 16:06:08 +0200]:
>
> Sam Steingold <···@gnu.org> writes:
>
>>> * Pascal Bourguignon <···@vasbezngvzntb.pbz> [2005-05-17 23:00:52 +0200]:
>>>
>>>     (when (every (function digit-char-p) string)
>>>       (assert (integerp (parse-integer string :junk-allowed nil))))
>>
>> (and (every (function digit-char-p) string)
>>      (parse-integer string))
>
> Yes, when there's only one expression in the body, and that expression
> returns only one value, WHEN and AND are equivalent.

I meant that ASSERT+INTEGERP+JUNK-ALLOWED were not necessary.

-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.iris.org.il> <http://www.camera.org> <http://ffii.org/>
<http://www.jihadwatch.org/> <http://www.openvotingconsortium.org/>
If I had known that it was harmless, I would have killed it myself.
From: Arthur Lemmens
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <opsqxhx6g6k6vmsw@news.xs4all.nl>
Sam Steingold <···@gnu.org> wrote:

>> Given that reality, the CLISP feature of being inclusive with
>> digit-char-p seems to introduce risk without a clear benefit. It seems
>> more likely that Unicode digits will be generated by accident than by
>> intent.
>
> I don't see the risk.

I do.

I'll bet there are lots of data formats and formal grammars
that have a line:
  digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
or something like that.  Most of those grammars probably didn't
mean #\LAO_DIGIT_FIVE when they specified '5'.

I'm also willing to bet that there are and have been lots of Lisp
programmers who think they can implement such a grammar rule in
their program easily by just calling DIGIT-CHAR-P.

Some of those programs will break when they encounter a
#\LAO_DIGIT_FIVE that was probably just part of a Unicode text.

> 3. The current situation in CLISP allows users to parse Unicode text
> (possibly interpreting numbers &c) by using Unicode attributes
> because DIGIT-CHAR-P returns useful values for Unicode digits.

I think it would be better for CLISP to stick to the principle of least
surprise and provide a separate function (maybe UNICODE-DIGIT-CHAR-P)
for this.
From: Sam Steingold
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <u4qd1ydl3.fsf@gnu.org>
> * Arthur Lemmens <········@kf4nyy.ay> [2005-05-17 18:40:44 +0200]:
>
> Sam Steingold <···@gnu.org> wrote:
>
>>> Given that reality, the CLISP feature of being inclusive with
>>> digit-char-p seems to introduce risk without a clear benefit. It seems
>>> more likely that Unicode digits will be generated by accident than by
>>> intent.
>>
>> I don't see the risk.
>
> I do.
>
> I'll bet there are lots of data formats and formal grammars
> that have a line:
>   digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
> or something like that.  Most of those grammars probably didn't
> mean #\LAO_DIGIT_FIVE when they specified '5'.
>
> I'm also willing to bet that there are and have been lots of Lisp
> programmers who think they can implement such a grammar rule in
> their program easily by just calling DIGIT-CHAR-P.

they are wrong: CLTS explicitly permits non-standard numeric characters.
(but not non-standard digits for READ).

Thus, if you are parsing text in some formal grammar, you can use the CL
reader which will confine you to the standard ASCII digits, but if you
are writing a parser for NLP, you will be able to figure out the class
of the character correctly.

-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.jihadwatch.org/> <http://www.palestinefacts.org/>
<http://www.camera.org> <http://www.openvotingconsortium.org/>
Never trust a man who can count to 1024 on his fingers.
From: Kalle Olavi Niemitalo
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <87u0l139pq.fsf@Astalo.kon.iki.fi>
Sam Steingold <···@gnu.org> writes:

> they are wrong: CLTS explicitly permits non-standard numeric characters.
> (but not non-standard digits for READ).

The spec appears to be inconsistent on this.

DIGIT-CHAR-P is specified to test "whether char is a digit in the
specified radix".  The glossary definition for "digit" points to
Section 13.1.4.6 (Digits in a Radix), which lists the "potential
digits".  The list includes only standard characters, and there
is no mention of implementation-defined additions.  Thus, it
seems that DIGIT-CHAR-P should return NIL for all non-standard
characters.

ALPHANUMERICP is specified to test whether "character is an
alphabetic[1] character or a numeric character".  Furthermore,
the dictionary entry has a note: "(alphanumericp x) == (or
(alpha-char-p x) (not (null (digit-char-p x))))", and
ALPHA-CHAR-P is specified to test whether "character is an
alphabetic[1] character".  Thus, DIGIT-CHAR-P should return true
for all numeric non-alphabetic characters.  However, Section
13.1.4.4 indicates that implementations can define non-standard
numeric characters.

I see a few ways to resolve this:

(a) Notes in dictionary entries are informational.  Declare that
    the note in ALPHANUMERICP is wrong and there may be numeric
    non-alphabetic characters for which DIGIT-CHAR-P returns
    false.  To detect whether a character is numeric, one should
    use (and (alphanumericp x) (not (alpha-char-p x))) rather
    than (digit-char-p x).  The next revision could define a new
    function for this purpose.

(b) Implementations may define nonstandard numeric characters but
    such characters must be alphabetic as well.

(c) Change DIGIT-CHAR-P so that it can return true for numeric
    characters even if they are not digits in a radix.

(d) Change Section 13.1.4.6 (Digits in a Radix) so that extended
    characters can be digits in a radix.  This would also affect
    the reader; see Section 2.3.1 (Numbers as Tokens).
From: Sam Steingold
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <uacmswqg3.fsf@gnu.org>
> * Kalle Olavi Niemitalo <···@vxv.sv> [2005-05-17 22:59:13 +0300]:
>
> Sam Steingold <···@gnu.org> writes:
>
>> they are wrong: CLTS explicitly permits non-standard numeric characters.
>> (but not non-standard digits for READ).
>
> The spec appears to be inconsistent on this.

Alas, there are several other places where the spec is inconsistent.
(Considering the huge size of the spec we can only admire the tiny
number of inconsistencies).
In such cases all we have is the "common practice"
moderated by "common sense".
I don't think that the "practice" of writing

  (or (every #'digit-char-p string) (parse-integer string))

is "sensible" (even if it is "common" which I doubt),
given the Unicode standard.


> I see a few ways to resolve this:

I don't think that CL should deviate from Unicode wrt characters.
Reinventing the wheel is always counterproductive.
Unicode offers many attributes, e.g. "digit", "number" &c.
CL accessors (digit-char-p, number-char, decomp-char, bidi-char &c)
should return them.

OTOH, CL reader should remain as it is now[1],
therefore Pascal's invariant should be modified like this:

(loop for i below 256
   for ch = (code-char i)
   unless (or (not (digit-char-p ch))
              (eql (digit-char-p ch) (read-from-string (string ch))))
   collect ch)
==> NIL

[1] decomp-char separates the font from the char:
MATHEMATICAL SANS-SERIF BOLD DIGIT FOUR (U+1D7F0) --> <font> 0034
so it is conceivable to backward compatibly extend the CL reader to
accept #\MATHEMATICAL_SANS-SERIF_BOLD_DIGIT_FOUR to mean the same as #\4
___INSIDE___ a number where all the other digits have the same font
(math sans-serif bold).

-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.dhimmi.com/> <http://www.mideasttruth.com/> <http://www.memri.org/>
<http://pmw.org.il/> <http://ffii.org/> <http://www.honestreporting.com>
People hear what they want to hear and discard the rest.
From: ··············@hotmail.com
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <1116458857.612795.19830@g14g2000cwa.googlegroups.com>
Sam Steingold wrote:

>
> I don't think that CL should deviate from Unicode wrt characters.
> Reinventing the wheel is always counterproductive.
> Unicode offers many attributes, e.g. "digit", "number" &c.
> CL accessors (digit-char-p, number-char, decomp-char, bidi-char &c)
> should return them.
>
> OTOH, CL reader should remain as it is now[1],
> therefore Pascal's invariant should be modified like this:


Ah, I see now your position more clearly; I had thought you were
claiming that standard Lisp's digit-char-p was inconsistent with the
behavior of the reader in gathering numbers. I withdraw my suggestion
that a "Frankenstein"-number, combining Unicode digits from multiple
scripts, should appear to the Lisp reader as a number. More careful
reading of the standard makes me believe the Lisp reader should read
your "Frankenstein"-number as a symbol, but because I think those
characters should return NIL for Lisp's digit-char-p.

I tripped by assuming the evidence that Unicode puts "digit" in the
name is justification for "digit-char-p" returning an integer. *If*
those characters return non-nil for digit-char-p, I believe consistency
*requires* the reader accept them as equivalent to standard digits, but
I think this is the wrong way to resolve the issue. (I.e., one
standard-compliant way would silently fold all these Unicode characters
to behave as indistinguishable from standard 0..9, but that is
non-desirable.)

I think we can agree

1) the standard definition of digit-char-p only applies to the
"standard" digits.
2) claiming that non-Latin digits should be interpreted as "standard"
digits 0..9 is stretching the standard pretty far.
3) even if it were plausible, using non-Latin digits routinely as
standard digits is bound to get into trouble
4) yet, there should be a way to recognize the wide spectrum of numeric
characters with their weight as digits.
5) it would be nice if the mechanism were standardized.

Clearly, to make you happy, we must change, deviate from, or extend the
ANSI standard one way or another. Formally, changing it or extending it
is a huge process, and I think we generally agree that is unlikely to
happen.

As for deviation, I happen to be conservative in that regard; making
digit-char-p take up the role you wish it to have, and deviating from
the standard in that way, seems to me to be extremely dangerous,
because the standard is then no longer a reliable reference to the
implementation. I feel we ought to just accept that the standard
doesn't match Unicode very well instead of stretching it to fit. I
think controlling the behavior with a new special variable (defaulting
to off?) or keyed to a feature would be perhaps acceptable, but still
dangerous (because the effects are non-local, and therefore hard to
determine if you are in the standard universe or the Unicode universe.)


Similar to my position on the unrelated thread on overloading #'+ to
map over vectors, I think if we want something new we should give it a
new name, and, if we can, just stop using digit-char-p (and
parse-integer) in applications, just like we tend to not to use
property lists on symbols: it exists for code that  was, unfortunately,
part of the more primitive pre-Unicode world.

That is, I prefer to introduce

unicode-digit-char-p (or unicode-digit-p?)

returning the weight 0..9 (independent of *read-base*?)

And, just as a cleanup, I would add numeric-char-p to fill the lacuna
in the standard that fails to easily identify implementation-defined
non-standard numeric characters. That would create an easy mapping from
standard concepts to predicates.

The Unicode experts would have to volunteer the full list of predicates
to read back every personal detail of each Unicode character. I look
forward to predicates that will recognize roman numerals, vulgar
fractions, etc., converting them to their natural Lisp types. :-)
From: Pascal Bourguignon
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <87k6lwkq6l.fsf@thalassa.informatimago.com>
···············@hotmail.com" <············@gmail.com> writes:
> Ah, I see now your position more clearly; I had thought you were
> claiming that standard Lisp's digit-char-p was inconsistent with the
> behavior of the reader in gathering numbers. I withdraw my suggestion
> that a "Frankenstein"-number, combining Unicode digits from multiple
> scripts, should appear to the Lisp reader as a number. More careful
> reading of the standard makes me believe the Lisp reader should read
> your "Frankenstein"-number as a symbol, but because I think those
> characters should return NIL for Lisp's digit-char-p.

What about this frankenstein number: 

     (let ((*read-base* 16)) (read-from-string "DeadFace"))

The current standard accepts higher base digits with two forms (upcase
and downcase). 
Why a new CL could not accept more forms for unicode digits?

You're saying that digits in a number should belong to the same
_script_.  I think I could agree with this.  But note that there are
numbers that are perhaps not frankenstein, just casimodo numbers, I'd
want (read-from-string "1‚X") to return 19, because as I see it, and
probably all users too, 1 and ‚X belong to the same script.


> I tripped by assuming the evidence that Unicode puts "digit" in the
> name is justification for "digit-char-p" returning an integer. *If*
> those characters return non-nil for digit-char-p, I believe consistency
> *requires* the reader accept them as equivalent to standard digits, but
> I think this is the wrong way to resolve the issue. (I.e., one
> standard-compliant way would silently fold all these Unicode characters
> to behave as indistinguishable from standard 0..9, but that is
> non-desirable.)

You'd have to justify this "non-desirable" vs. the "desirability" of
#xAaBb being accepted for 43707.  CL specifies a silent folding of
ASCII  (standard characters) to behave indistinguishably...


> I think we can agree
>
> 1) the standard definition of digit-char-p only applies to the
> "standard" digits.

Good, then in clisp (digit-char-p (character "‚X")) should return nil.


> 2) claiming that non-Latin digits should be interpreted as "standard"
> digits 0..9 is stretching the standard pretty far.

ok.


> 3) even if it were plausible, using non-Latin digits routinely as
> standard digits is bound to get into trouble

We should distinguish:

- non-Latin digits used in strange systems (Roman-like, Babylonian,
  Hebrew, etc),

- non-Latin digits used with the normal decimal system (Arabic, etc),

- various forms of Latin digits (FULLWIDTH_DIGIT_*, MATHEMATIC_*_DIGIT_*),

- the standard DIGIT_*.

I'd say that all but the first category should be readable as numbers
in base 10.

I'd agree to limit other bases to the last two (adding latin letters),
given that most uses of bases has been developed recently, only with
these graphs, but note that Maya is pure base 20.

As for the other numerical systems, the problem is that these numbers
have a syntax (possibly different for each numerical system), and they
are more _naming_ systems than real numerical systems (see for example
Chinese or Japanese).  So for the same reasons the reader doesn't read
back roman or english numbers printed by ~R, we could avoid reading
these numbers.


> 4) yet, there should be a way to recognize the wide spectrum of numeric
> characters with their weight as digits.

I'd say this is a unicode problem, so we should design a standard
UNICODE package to hold these features.


> 5) it would be nice if the mechanism were standardized.

Indeed.


> Clearly, to make you happy, we must change, deviate from, or extend the
> ANSI standard one way or another. Formally, changing it or extending it
> is a huge process, and I think we generally agree that is unlikely to
> happen.
>
> As for deviation, I happen to be conservative in that regard; making
> digit-char-p take up the role you wish it to have, and deviating from
> the standard in that way, seems to me to be extremely dangerous,
> because the standard is then no longer a reliable reference to the
> implementation. I feel we ought to just accept that the standard
> doesn't match Unicode very well instead of stretching it to fit. I
> think controlling the behavior with a new special variable (defaulting
> to off?) or keyed to a feature would be perhaps acceptable, but still
> dangerous (because the effects are non-local, and therefore hard to
> determine if you are in the standard universe or the Unicode universe.)
>
>
> Similar to my position on the unrelated thread on overloading #'+ to
> map over vectors, I think if we want something new we should give it a
> new name, and, if we can, just stop using digit-char-p (and
> parse-integer) in applications, just like we tend to not to use
> property lists on symbols: it exists for code that  was, unfortunately,
> part of the more primitive pre-Unicode world.
>
> That is, I prefer to introduce
>
> unicode-digit-char-p (or unicode-digit-p?)
>
> returning the weight 0..9 (independent of *read-base*?)

and 10,20,...,90 and 100,200,...,900, and 0..19, and 0..59, and other
values (1e3, 1e6, ...), according to the digits.  Note also that in
systems like Hebrew, alphabetical letters are digits too. For these
scripts, you'd have both unicode:digit-char-p and unicode:alpha-char-p.
(in the same way you also have (common-lisp:digit-char-p #\Z 36) == 35).


> And, just as a cleanup, I would add numeric-char-p to fill the lacuna
> in the standard that fails to easily identify implementation-defined
> non-standard numeric characters. That would create an easy mapping from
> standard concepts to predicates.
>
> The Unicode experts would have to volunteer the full list of predicates
> to read back every personal detail of each Unicode character. I look
> forward to predicates that will recognize roman numerals, vulgar
> fractions, etc., converting them to their natural Lisp types. :-)

(every (function roman-digit-p) "IVXCLMivxclm")

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we. -- Georges W. Bush
From: ··············@hotmail.com
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <1116527316.305708.56080@o13g2000cwo.googlegroups.com>
Pascal Bourguignon wrote:
> ···············@hotmail.com" <············@gmail.com> writes:
> > Ah, I see now your position more clearly; I had thought you were
> > claiming that standard Lisp's digit-char-p was inconsistent with
the
> > behavior of the reader in gathering numbers. I withdraw my
suggestion
> > that a "Frankenstein"-number, combining Unicode digits from
multiple
> > scripts, should appear to the Lisp reader as a number. More careful
> > reading of the standard makes me believe the Lisp reader should
read
> > your "Frankenstein"-number as a symbol, but because I think those
> > characters should return NIL for Lisp's digit-char-p.
>
> What about this frankenstein number:
>
>      (let ((*read-base* 16)) (read-from-string "DeadFace"))
>
> The current standard accepts higher base digits with two forms
(upcase
> and downcase).
> Why a new CL could not accept more forms for unicode digits?
>

The "DeadFace" (assuming they are the from the 52 "standard" A..Z,
a..z, characters chosen by the implementation, and not from some higher
code page like "U+FF24 fullwidth latin capital letter D" distinct from
the likely standard U+0044 Latin Capital Letter D"), clearly falls
within the standard's definition of "digits in a radix", but there can
only be "D" and "d" not many different "D"-like characters.

I.e., I'm assuming that for character #\D from your example

(digit-char-p #\D 16) ==> 13, as it *should* for the "standard" "D"/"d"
but *not* for the possibly implementation-defined "non-standard D"
characters.

My reading of the standard is that the implementation gets to choose
which 96 characters it is using as members of the "standard" set,
(i.e., could be EBCDIC or other coding, or some wild choice from among
all the Unicode spectrum) and anything else is either mapped
indistingishably to one of those 96, or is a implementation-defined
character which cannot satisfy digit-char-p for *any* radix.

A new CL that accepted more forms as digits would be just
that--"new"--as in "distinct from the current ANSI standard."

>
> > I think we can agree
> >
> > 1) the standard definition of digit-char-p only applies to the
> > "standard" digits.
>
> Good, then in clisp (digit-char-p (character "9")) should return
nil.

On my browser, I can't tell the difference between that and the usual
ASCII "9" , but if you are talking about something like U+FF19, I would
agree. (Unless clisp decides U+FF19 is the "standard nine" and U+0039
is an implementation-defined numeric character, but this would be quite
ASCII-hostile).

>
>
> > 3) even if it were plausible, using non-Latin digits routinely as
> > standard digits is bound to get into trouble
>
> We should distinguish:
>
> - non-Latin digits used in strange systems (Roman-like, Babylonian,
>   Hebrew, etc),
>
> - non-Latin digits used with the normal decimal system (Arabic, etc),
>
> - various forms of Latin digits (FULLWIDTH_DIGIT_*,
MATHEMATIC_*_DIGIT_*),
>
> - the standard DIGIT_*.
>
> I'd say that all but the first category should be readable as numbers
> in base 10.

Once we agree on predicates for the various Unicode properties,
decisions about "readability" like this would involve *additional*
extensions to the reader & parse-integer also beyond the current
standard. That's a much more complicated discussion than the Lisp
functions which classify characters, because then we have to worry,
just for starters, about Lisp code that might have these characters in
them, and compatibility with old ANSI-conformant implementations that
have no idea what to make of FULLWIDTH_DIGIT_* in what we wish to be an
integer constant, or in reader macros like #2A((1 2) (3 4)). (I guess
that's the main answer to my "Why not [accept Frankenstein numbers in
the reader]?")
From: Pascal Bourguignon
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <87wtpvhv6a.fsf@thalassa.informatimago.com>
···············@hotmail.com" <············@gmail.com> writes:

> Pascal Bourguignon wrote:
>> ···············@hotmail.com" <············@gmail.com> writes:
>> > Ah, I see now your position more clearly; I had thought you were
>> > claiming that standard Lisp's digit-char-p was inconsistent with
> the
>> > behavior of the reader in gathering numbers. I withdraw my
> suggestion
>> > that a "Frankenstein"-number, combining Unicode digits from
> multiple
>> > scripts, should appear to the Lisp reader as a number. More careful
>> > reading of the standard makes me believe the Lisp reader should
> read
>> > your "Frankenstein"-number as a symbol, but because I think those
>> > characters should return NIL for Lisp's digit-char-p.
>>
>> What about this frankenstein number:
>>
>>      (let ((*read-base* 16)) (read-from-string "DeadFace"))
>>
>> The current standard accepts higher base digits with two forms
> (upcase
>> and downcase).
>> Why a new CL could not accept more forms for unicode digits?
>>
>
> The "DeadFace" (assuming they are the from the 52 "standard" A..Z,
> a..z, characters chosen by the implementation, and not from some higher
> code page like "U+FF24 fullwidth latin capital letter D" distinct from
> the likely standard U+0044 Latin Capital Letter D"), clearly falls
> within the standard's definition of "digits in a radix", but there can
> only be "D" and "d" not many different "D"-like characters.

I mean the standard just arbitrarily specifies that D and d will be
the same digit in an adequate base.  It doesn't specify a general rule.

I don't see more difference between #\DIGIT_NINE and
#\FULLWIDTH_DIGIT_NINE than between #\LATIN_CAPITAL_LETTER_D and
#\LATIN_SMALL_LETTER_D.  So I think it would be desirable to have a
more general rule about it (perhaps "implementation" dependant).  I
was hoping that DIGIT-CHAR-P could be used to express to generalize
this arbitrary specification...


> I.e., I'm assuming that for character #\D from your example
>
> (digit-char-p #\D 16) ==> 13, as it *should* for the "standard" "D"/"d"
> but *not* for the possibly implementation-defined "non-standard D"
> characters.
>
> My reading of the standard is that the implementation gets to choose
> which 96 characters it is using as members of the "standard" set,
> (i.e., could be EBCDIC or other coding, or some wild choice from among
> all the Unicode spectrum) and anything else is either mapped
> indistingishably to one of those 96, or is a implementation-defined
> character which cannot satisfy digit-char-p for *any* radix.
>
> A new CL that accepted more forms as digits would be just
> that--"new"--as in "distinct from the current ANSI standard."
>
>>
>> > I think we can agree
>> >
>> > 1) the standard definition of digit-char-p only applies to the
>> > "standard" digits.
>>
>> Good, then in clisp (digit-char-p (character "9")) should return
> nil.
>
> On my browser, I can't tell the difference between that and the usual
> ASCII "9" , but if you are talking about something like U+FF19, 

Yes.

> I would agree. (Unless clisp decides U+FF19 is the "standard nine" and U+0039
> is an implementation-defined numeric character, but this would be quite
> ASCII-hostile).


>> > 3) even if it were plausible, using non-Latin digits routinely as
>> > standard digits is bound to get into trouble
>>
>> We should distinguish:
>>
>> - non-Latin digits used in strange systems (Roman-like, Babylonian,
>>   Hebrew, etc),
>>
>> - non-Latin digits used with the normal decimal system (Arabic, etc),
>>
>> - various forms of Latin digits (FULLWIDTH_DIGIT_*,
> MATHEMATIC_*_DIGIT_*),
>>
>> - the standard DIGIT_*.
>>
>> I'd say that all but the first category should be readable as numbers
>> in base 10.
>
> Once we agree on predicates for the various Unicode properties,
> decisions about "readability" like this would involve *additional*
> extensions to the reader & parse-integer also beyond the current
> standard. That's a much more complicated discussion than the Lisp
> functions which classify characters, because then we have to worry,
> just for starters, about Lisp code that might have these characters in
> them, and compatibility with old ANSI-conformant implementations that
> have no idea what to make of FULLWIDTH_DIGIT_* in what we wish to be an
> integer constant, or in reader macros like #2A((1 2) (3 4)). (I guess
> that's the main answer to my "Why not [accept Frankenstein numbers in
> the reader]?")

We already have this problem. If you want to print readably portable
program or data, you have to do it only with the standard character
set.  Current implementations are allowed extended characters in
symbols.  At this level, I see no difference between a symbol and a
number.  How do you justify the restriction on numbers?  

(defun ⌺ (&rest arguments)
   "Implement the function: APL_FUNCTIONAL_SYMBOL_QUAD_DIAMOND"
   (quad-diamond arguments))

(⌺ #(1 2 3))

;; Or:

(defun RACINE-CARRÉE (x) (sqrt x))


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
Kitty like plastic.
Confuses for litter box.
Don't leave tarp around.
From: ··············@hotmail.com
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <1116532805.613973.136880@z14g2000cwz.googlegroups.com>
Pascal Bourguignon wrote:
> ···············@hotmail.com" <············@gmail.com> writes:

> >
> > The "DeadFace" (assuming they are the from the 52 "standard" A..Z,
> > a..z, characters chosen by the implementation, and not from some
higher
> > code page like "U+FF24 fullwidth latin capital letter D" distinct
from
> > the likely standard U+0044 Latin Capital Letter D"), clearly falls
> > within the standard's definition of "digits in a radix", but there
can
> > only be "D" and "d" not many different "D"-like characters.
>
> I mean the standard just arbitrarily specifies that D and d will be
> the same digit in an adequate base.  It doesn't specify a general
rule.

The standard exhaustively lists 36 potential "digits in a radix", in
the same way it lists the standard characters, and only refers to case
(which can only be upper and lowercase) as an allowed variation,
allowing 26 variants not explicitly listed. I don't see any way to
accept additional characters in as "digits".

> I don't see more difference between #\DIGIT_NINE and
> #\FULLWIDTH_DIGIT_NINE than between #\LATIN_CAPITAL_LETTER_D and
> #\LATIN_SMALL_LETTER_D.  So I think it would be desirable to have a
> more general rule about it (perhaps "implementation" dependant).  I
> was hoping that DIGIT-CHAR-P could be used to express to generalize
> this arbitrary specification...


Well, I can see the distinction made in 2.1.3 between "LD01/small d"
and "LD02/capital D" as standard characters, but I don't see any other
entry in that section that looks like an alternative for "ND09/digit
9".

...

> set.  Current implementations are allowed extended characters in
> symbols.  At this level, I see no difference between a symbol and a
> number.  How do you justify the restriction on numbers?

Section 2.3.4 explicitly mentions character attributes for symbols.
Section 2.3.1 does not mention character attributes, or
implementation-defined characters, or anything beyond "digits in a
radix" "sign", "slash", "decimal point", or "exponent-marker" for
numbers.
From: Pascal Bourguignon
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <87hdgzht30.fsf@thalassa.informatimago.com>
···············@hotmail.com" <············@gmail.com> writes:
> Well, I can see the distinction made in 2.1.3 between "LD01/small d"
> and "LD02/capital D" as standard characters, but I don't see any other
> entry in that section that looks like an alternative for "ND09/digit
> 9".
>
> ...
>
>> set.  Current implementations are allowed extended characters in
>> symbols.  At this level, I see no difference between a symbol and a
>> number.  How do you justify the restriction on numbers?
>
> Section 2.3.4 explicitly mentions character attributes for symbols.
> Section 2.3.1 does not mention character attributes, or
> implementation-defined characters, or anything beyond "digits in a
> radix" "sign", "slash", "decimal point", or "exponent-marker" for
> numbers.

Yes, that's the way it's specified.


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we. -- Georges W. Bush
From: ··············@hotmail.com
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <1116440853.976853.285100@g49g2000cwa.googlegroups.com>
Sam Steingold wrote:
> > * Arthur Lemmens <········@kf4nyy.ay> [2005-05-17 18:40:44 +0200]:
> >
> > Sam Steingold <···@gnu.org> wrote:

>
> they are wrong: CLTS explicitly permits non-standard numeric
characters.
> (but not non-standard digits for READ).
>
> Thus, if you are parsing text in some formal grammar, you can use the
CL
> reader which will confine you to the standard ASCII digits, but if
you
> are writing a parser for NLP, you will be able to figure out the
class
> of the character correctly.

You should say "standard character digits" not ASCII, but I'll take
that as a typo.

I still don't see how the definition of digit-char-p should return
non-nil for non-standard numeric characters; it refers to "digits in a
radix" (not "numeric digits") which is defined in 13.1.4.6 by
explicitly listing the potential digits in the same way that it denotes
the 96 standard characters, without mentioning implementation-dependent
numeric characters.

You seem to be taking digit-char-p in the sense that the standard would
define something named numeric-char-p. Once you a character satisfies
digit-char-p, the character is a digit in the radix, and should satisfy
the reader as described in 2.3.1.

Whether a character is defined by the Unicode standard as a digit or
not has nothing to do with whether the Lisp standard defines it as a
"digit in a radix" (or even "numeric"; although I think it would be
stupid to deny that a "Unicode digit" is "Lisp numeric," it is only
implementation-defined.)
From: Sam Steingold
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <ud5rousz2.fsf@gnu.org>
> * ··············@hotmail.com <············@tznvy.pbz> [2005-05-18
> 11:27:33 -0700]: 
>
> Sam Steingold wrote:
>> > * Arthur Lemmens <········@kf4nyy.ay> [2005-05-17 18:40:44 +0200]:
>> >
>> > Sam Steingold <···@gnu.org> wrote:
>
>>
>> they are wrong: CLTS explicitly permits non-standard numeric
> characters.
>> (but not non-standard digits for READ).
>>
>> Thus, if you are parsing text in some formal grammar, you can use the
> CL
>> reader which will confine you to the standard ASCII digits, but if
> you
>> are writing a parser for NLP, you will be able to figure out the
> class
>> of the character correctly.
>
> You should say "standard character digits" not ASCII, but I'll take
> that as a typo.

yes, thanks.

> You seem to be taking digit-char-p in the sense that the standard
> would define something named numeric-char-p. Once you a character
> satisfies digit-char-p, the character is a digit in the radix, and
> should satisfy the reader as described in 2.3.1.

Unicode has separate "digit" and "numeric" attributes,
thus we need both digit-char-p and numeric-char-p.

-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.openvotingconsortium.org/> <http://www.honestreporting.com>
<http://pmw.org.il/> <http://www.jihadwatch.org/> <http://ffii.org/>
If Perl is the solution, you're solving the wrong problem. - Erik Naggum
From: Pascal Bourguignon
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <87k6lxy3p5.fsf@thalassa.informatimago.com>
···············@hotmail.com" <············@gmail.com> writes:
> [...]
> Given that reality, the CLISP feature of being inclusive with
> digit-char-p seems to introduce risk without a clear benefit. It seems
> more likely that Unicode digits will be generated by accident than by
> intent.
>
> However, I have zero experience with this kind of
> internationalization/character-coding issue, so I have no way to gauge
> the true impact.

I stumbled on this problem when I read a japanese post on the usenet
in which numbers were written with #\FULLWIDTH_DIGIT_ZERO to
#\FULLWIDTH_DIGIT_NINE.  I think japaneses, chineses and koreans use
quite often these wide digits instead of the ASCII ones.


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
From: Pascal Bourguignon
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <87br79y27j.fsf@thalassa.informatimago.com>
Sam Steingold <···@gnu.org> writes:

> These are Unicode characters that have the "digit" Unicode attribute.

The "Unicode digit" attribute!


> CLTS:
>
>       digit n. (in a radix) a character that is among the possible
>       digits (0 to 9, A to Z, and a to z) and that is defined to have an
>       associated numeric weight as a digit in that radix. See Section
>       13.1.4.6 (Digits in a Radix).
>
> <http://www.lisp.org/HyperSpec/Body/sec_13-1-4-6.html> appears to be
> fairly specific: only the standard ASCII characters are potential
> digits.  Therefore the Unicode characters with the digit attribute are
> numeric characters but not digits.

And comparing with the specification of DIGIT-CHAR-P:

Description:

    Tests whether char is a digit in the specified radix (i.e., with a
    weight less than radix). If it is a digit in that radix, its
    weight is returned as an integer; otherwise nil is returned. 

DIGIT-CHAR-P doesn't cite numeric characters, only pure digits in radix.

If we take section "13.1.4.6 Digits in a Radix" as an axiom then
DIGIT-CHAR-P should not return true for any other character than these
standard characters.  It sounds like (DIGIT-CHAR-P #\FULLWIDTH_DIGIT_ONE)
should be false, and implementations should rather add a distinct
UNICODE:DIGIT-CHAR-P function where 
(UNICODE:DIGIT-CHAR-P #\FULLWIDTH_DIGIT_ONE) would be 1.


> Suppose we try to keep the Pascal's invariant in all radixes.
>
> #\MATHEMATICAL_BOLD_DIGIT_NINE is 9 in radix 16.
> How about #\MATHEMATICAL_BOLD_CAPITAL_A in radix 16?
> It would see reasonable to expect it to be 10.

Indeed, if we extended section "13.1.4.6 Digits in a Radix".

> (yes, we can use the "character decomposition" to map
> #\MATHEMATICAL_BOLD_CAPITAL_A to #\A, so, it is possible to arrange that
>    (let ((*read-base* 16))
>      (read-from-string
>       (concatenate 'string
>         (string #\MATHEMATICAL_BOLD_CAPITAL_A)
>         (string #\MATHEMATICAL_BOLD_DIGIT_NINE))))
> returns #xA9).
>
> Now, how about the first letter of the ETHIOPIC alphabet?
> How about alphabets with fewer than 26 letters?
> More than 26 letters?  - why not extend the notion of a radix?
>
> Then, how about #\ETHIOPIC_NUMBER_TEN?
> #\ETHIOPIC_NUMBER_THIRTY?
> #\ETHIOPIC_NUMBER_HUNDRED?
> What should
>      (read-from-string
>       (concatenate 'string
>         (string #\ETHIOPIC_NUMBER_HUNDRED)
>         (string #\ETHIOPIC_NUMBER_TEN)
>         (string #\ETHIOPIC_DIGIT_NINE)))
> return? 119? 100109?  (the Ethiopic system is not positional).

I've always found that limiting radix to 36 was too artificial.
Indeed there are notations with higher radices.


> Or even funnier:
> should
>
>      (read-from-string
>       (concatenate 'string
>         (string #\ARABIC-INDIC_DIGIT_ONE)
>         (string #\DEVANAGARI_DIGIT_TWO)
>         (string #\BENGALI_DIGIT_THREE)
>         (string #\GUJARATI_DIGIT_FOUR)
>         (string #\TAMIL_DIGIT_FIVE)))
>
> return 12345?

> My points are:
>
> 1. Pascal's invariant is not required by the CLTS.

Ok.

> 2. Requiring Pascal's invariant would produce weird results in
>    implementations that use Unicode.

Ok, Unicode (writting systems) is complex.


> 3. The current situation in CLISP allows users to parse Unicode text
>    (possibly interpreting numbers &c) by using Unicode attributes
>    because DIGIT-CHAR-P returns useful values for Unicode digits.

Then I think that DIGIT-CHAR-P should be NIL for any character that is
not a potential digit as defined in section "13.1.4.6 Digits in a
Radix", and that implementations should provide a _distinct_ function
UNICODE:DIGIT-CHAR-P.


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
Until real software engineering is developed, the next best practice
is to develop with a dynamic system that has extreme late binding in
all aspects. The first system to really do this in an important way
is Lisp. -- Alan Kay
From: Sam Steingold
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <ufywkws8f.fsf@gnu.org>
> * Pascal Bourguignon <···@vasbezngvzntb.pbz> [2005-05-17 23:25:36 +0200]:
>
>
>> 3. The current situation in CLISP allows users to parse Unicode text
>>    (possibly interpreting numbers &c) by using Unicode attributes
>>    because DIGIT-CHAR-P returns useful values for Unicode digits.
>
> Then I think that DIGIT-CHAR-P should be NIL for any character that is
> not a potential digit as defined in section "13.1.4.6 Digits in a
> Radix", and that implementations should provide a _distinct_ function
> UNICODE:DIGIT-CHAR-P.

if you want the limited version of DIGIT-CHAR-P, do this:

(defun ASCII-DIGIT-CHAR-P (c)
  (and (< (char-code c) 256)
       (digit-char-p c)))

-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.iris.org.il> <http://www.dhimmi.com/>
<http://www.memri.org/> <http://www.honestreporting.com> <http://pmw.org.il/>
If a train station is a place where a train stops, what's a workstation?
From: Kent M Pitman
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <uekc4txs0.fsf@nhplace.com>
Sam Steingold <···@gnu.org> writes:

> > * Pascal Bourguignon <···@vasbezngvzntb.pbz> [2005-05-17 23:25:36 +0200]:
> >
> >
> >> 3. The current situation in CLISP allows users to parse Unicode text
> >>    (possibly interpreting numbers &c) by using Unicode attributes
> >>    because DIGIT-CHAR-P returns useful values for Unicode digits.
> >
> > Then I think that DIGIT-CHAR-P should be NIL for any character that is
> > not a potential digit as defined in section "13.1.4.6 Digits in a
> > Radix", and that implementations should provide a _distinct_ function
> > UNICODE:DIGIT-CHAR-P.
> 
> if you want the limited version of DIGIT-CHAR-P, do this:
> 
> (defun ASCII-DIGIT-CHAR-P (c)
>   (and (< (char-code c) 256)
>        (digit-char-p c)))

Strictly, this is not portable code.

[Also, if ASCII was 7 bits, why so arbitrary a constant as 256?]

Probably indeed with most extant Common Lisp implementations, there's
either ASCII, some sort of ISO-Latin, or Unicode and this will work
because the digits are probably located identically.

But, strictly, there is no requirement at all that the ASCII character
codes be given char-codes corresponding to their ASCII value.

For example, the Symbolics Lisp machine, having evolved from the "SAIL
character set" (developed at the Stanford AI Lab, dunno if they still
call it that--MIT gave up its traditional lab names for new, stupider
ones), uses the low character codes for Bullets, Arrows,
circle-crosses, and whatnot.  The above probably holds true, but only
by accident, because I think the digits are in the ASCII place.  I'm
quite sure that if you implemented
 (defun ascii-graphic-char-p (c)
   (and (< (char-code c) 256)
        (graphic-char-p c)))
you'd get some wrong values.  e.g., 0 is not the code of a character 
that is a graphic in ASCII, but is the code of a character that is a 
graphic in the SAIL or LispM character set.

In general, though, my point is that you can't be certain what the standard
characters are coded.

At minimum, you should write some #+/#- conditionals and/or you should
have code after your defun that just tests that all ten digits
actually return T from this ad hoc implementation, and signals an
error on load in an implementation where that condition fails.
From: Sam Steingold
Subject: Re: Invariant with DIGIT-CHAR-P and the reader.
Date: 
Message-ID: <uu0l0v3t4.fsf@gnu.org>
> * Kent M Pitman <······@aucynpr.pbz> [2005-05-18 14:24:22 +0000]:
>
> Sam Steingold <···@gnu.org> writes:
>
>> > * Pascal Bourguignon <···@vasbezngvzntb.pbz> [2005-05-17 23:25:36 +0200]:
>> >
>> >
>> >> 3. The current situation in CLISP allows users to parse Unicode text
>> >>    (possibly interpreting numbers &c) by using Unicode attributes
>> >>    because DIGIT-CHAR-P returns useful values for Unicode digits.
>> >
>> > Then I think that DIGIT-CHAR-P should be NIL for any character that is
>> > not a potential digit as defined in section "13.1.4.6 Digits in a
>> > Radix", and that implementations should provide a _distinct_ function
>> > UNICODE:DIGIT-CHAR-P.
>> 
>> if you want the limited version of DIGIT-CHAR-P, do this:
>> 
>> (defun ASCII-DIGIT-CHAR-P (c)
>>   (and (< (char-code c) 256)
>>        (digit-char-p c)))
>
> Strictly, this is not portable code.

sorry, I should have written

(defun standard-digit-char-p (c)
  (and (standard-char-p c)
       (digit-char-p c)))

it was more tong in cheek than some real portable usable code
suggestion.


-- 
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.mideasttruth.com/> <http://www.openvotingconsortium.org/>
<http://pmw.org.il/> <http://www.dhimmi.com/> <http://www.camera.org>
Those who can laugh at themselves will never cease to be amused.