From: ···@pobox.com
Subject: whitespace-char-p
Date: 
Message-ID: <1174672973.932101.303880@n76g2000hsh.googlegroups.com>
In Common Lisp how do I test if a character is a whitespace character
(in the current readtable)?

I have a solution which I am confident works but I'm not confident
that it's best, recommended, or idiomatic.

drj

From: dpapathanasiou
Subject: Re: whitespace-char-p
Date: 
Message-ID: <1174681107.981604.68160@y80g2000hsf.googlegroups.com>
On Mar 23, 1:02 pm, ····@pobox.com wrote:
> In Common Lisp how do I test if a character is a whitespace character
> (in the current readtable)?

CL-PPCRE (Portable Perl-compatible regular expressions for Common
Lisp), the gold standard of text parsing libraries, defines whitespace
this way:

  (unless (boundp '+whitespace-char-string+)
    (defconstant +whitespace-char-string+
      (coerce
       '(#\Space #\Tab #\Linefeed #\Return #\Page)
       'string)
      "A string of all characters which are considered to be
whitespace.
Same as Perl's [\\s]."))

And uses those values to do whitespace remove/replace/split/etc.

The code snippet above came from the "util.lisp" file, but you can
download the whole CL-PPCRE source from http://weitz.de/cl-ppcre/
From: ···@pobox.com
Subject: Re: whitespace-char-p
Date: 
Message-ID: <1174688268.168006.43370@b75g2000hsg.googlegroups.com>
On Mar 23, 8:18 pm, "dpapathanasiou" <···················@gmail.com>
wrote:
> On Mar 23, 1:02 pm, ····@pobox.com wrote:
>
> > In Common Lisp how do I test if a character is a whitespace character
> > (in the current readtable)?
>
> CL-PPCRE (Portable Perl-compatible regular expressions for Common
> Lisp), the gold standard of text parsing libraries, defines whitespace
> this way:

Hmm, I should probably make myself a bit clearer.  I want a test (or
definition if you like) that is sensitive to the current readtable and
the changes that I might have made using set-syntax-from-char.  For
example if I go:

(set-syntax-from-char #\_ #\ )

Then I want (whitespace-char-p #\_) to be true.  I want to inspect the
current readtable.

As it happens that PCRE code uses semi-standard characters (4 of them)
which makes it not maximally portable.

I want something a bit cleverer than (typep x '(member #\Space #
\Newline)).

drj
From: Pascal Bourguignon
Subject: Re: whitespace-char-p
Date: 
Message-ID: <874po8xrv7.fsf@voyager.informatimago.com>
···@pobox.com writes:

> On Mar 23, 8:18 pm, "dpapathanasiou" <···················@gmail.com>
> wrote:
>> On Mar 23, 1:02 pm, ····@pobox.com wrote:
>>
>> > In Common Lisp how do I test if a character is a whitespace character
>> > (in the current readtable)?
>>
>> CL-PPCRE (Portable Perl-compatible regular expressions for Common
>> Lisp), the gold standard of text parsing libraries, defines whitespace
>> this way:
>
> Hmm, I should probably make myself a bit clearer.  I want a test (or
> definition if you like) that is sensitive to the current readtable and
> the changes that I might have made using set-syntax-from-char.  For
> example if I go:
>
> (set-syntax-from-char #\_ #\ )
>
> Then I want (whitespace-char-p #\_) to be true.  I want to inspect the
> current readtable.
>
> As it happens that PCRE code uses semi-standard characters (4 of them)
> which makes it not maximally portable.
>
> I want something a bit cleverer than (typep x '(member #\Space #
> \Newline)).


This is not possible at all, portably.

As you've noticed, the only standard, public, function dealing with
character syntaxes is set-syntax-from-char.



If you want portable code, you can re-implement the lisp reader,
portably.  Or use mine (very new):
http://darcs.informatimago.com/lisp/common-lisp/reader.lisp

But this may not help you very much, if you want to do that at load
time or compilation time (you'd have to re-implement LOAD or
COMPILE-FILE to use this reader instead of the "closed" implementation
specific one.



Of course, given that we have sources of most implementations, you
could also just add the needed API to all of them.  That'd exclude you
from portability to commercial implementations, but then with half a
dozen free CL implementation publishing the same API, that'd be a
de-facto standard, perhaps the commercial implementers would notice.
Anyways, the main difficulty here is that you'd have to present a
uniform interface over different implementations, it may not be
trivial.


-- 
__Pascal Bourguignon__
http://www.informatimago.com
http://pjb.ogamita.org
From: ···@pobox.com
Subject: Re: whitespace-char-p
Date: 
Message-ID: <1174928129.013062.292830@e65g2000hsc.googlegroups.com>
On Mar 26, 10:43 am, Pascal Bourguignon <····@informatimago.com>
wrote:
> ····@pobox.com writes:
> > On Mar 23, 8:18 pm, "dpapathanasiou" <···················@gmail.com>
> > wrote:
> >> On Mar 23, 1:02 pm, ····@pobox.com wrote:
>
> >> > In Common Lisp how do I test if a character is a whitespace character
> >> > (in the current readtable)?
>
> >> CL-PPCRE (Portable Perl-compatible regular expressions for Common
> >> Lisp), the gold standard of text parsing libraries, defines whitespace
> >> this way:
>
> > Hmm, I should probably make myself a bit clearer.  I want a test (or
> > definition if you like) that is sensitive to the current readtable and
> > the changes that I might have made using set-syntax-from-char.  For
> > example if I go:
>
> > (set-syntax-from-char #\_ #\ )
>
> > Then I want (whitespace-char-p #\_) to be true.  I want to inspect the
> > current readtable.
>
> > As it happens that PCRE code uses semi-standard characters (4 of them)
> > which makes it not maximally portable.
>
> > I want something a bit cleverer than (typep x '(member #\Space #
> > \Newline)).
>
> This is not possible at all, portably.

Are you claiming my solution does not work, or is not portable?  If
so, I'd like more details.  If you're merely claiming that it's not
possible inspect the readtable directly, well then I agree, sadly.

drj
From: Pascal Bourguignon
Subject: Re: whitespace-char-p
Date: 
Message-ID: <87lkhjx5wf.fsf@voyager.informatimago.com>
···@pobox.com writes:

> On Mar 26, 10:43 am, Pascal Bourguignon <····@informatimago.com>
> wrote:
>> ····@pobox.com writes:
>> > On Mar 23, 8:18 pm, "dpapathanasiou" <···················@gmail.com>
>> > wrote:
>> >> On Mar 23, 1:02 pm, ····@pobox.com wrote:
>>
>> >> > In Common Lisp how do I test if a character is a whitespace character
>> >> > (in the current readtable)?
>>
>> >> CL-PPCRE (Portable Perl-compatible regular expressions for Common
>> >> Lisp), the gold standard of text parsing libraries, defines whitespace
>> >> this way:
>>
>> > Hmm, I should probably make myself a bit clearer.  I want a test (or
>> > definition if you like) that is sensitive to the current readtable and
>> > the changes that I might have made using set-syntax-from-char.  For
>> > example if I go:
>>
>> > (set-syntax-from-char #\_ #\ )
>>
>> > Then I want (whitespace-char-p #\_) to be true.  I want to inspect the
>> > current readtable.
>>
>> > As it happens that PCRE code uses semi-standard characters (4 of them)
>> > which makes it not maximally portable.
>>
>> > I want something a bit cleverer than (typep x '(member #\Space #
>> > \Newline)).
>>
>> This is not possible at all, portably.
>
> Are you claiming my solution does not work, or is not portable?  If
> so, I'd like more details.  If you're merely claiming that it's not
> possible inspect the readtable directly, well then I agree, sadly.

Well, I hadn't considered this kind of solution in this case, (and
hadn't read the whole thread before posting my answer).  I was
meaning, to directly access the implementation reader data structures.


In this specific case, wanting to know whether a character has this
specific syntax (whitespace, perhaps it would work for some other
character syntaxes too), it looks like your solution can work portably
and safely enough, if slowly.

-- 
__Pascal Bourguignon__
http://www.informatimago.com
http://pjb.ogamita.org
From: ···@pobox.com
Subject: Re: whitespace-char-p
Date: 
Message-ID: <1174733138.672741.301270@l75g2000hse.googlegroups.com>
On Mar 23, 6:02 pm, ····@pobox.com wrote:
> In Common Lisp how do I test if a character is a whitespace character
> (in the current readtable)?
>
> I have a solution which I am confident works but I'm not confident
> that it's best, recommended, or idiomatic.

The problem is that it's not possible (unless I've missed something)
to directly inspect the readtable.  My solution essentially observes
the behavious of READ and deduces facts about the readtable from
that.  In essence I create a one character string and pass that to
READ-FROM-STRING.  You either get NIL (for whitespace), some other
object (the integer 1 for example), or an error.  You have to test for
character macros first (otherwise you end up thinking that #\; is
whitespace).

Here's the code:

(defun whitespace-char-p (x)
  (and (not (get-macro-character x))
       (values (ignore-errors
                 (not (read-from-string (coerce (list x) 'string)
nil))))))

This seems incredibly long-winded.

It also can intern symbols, (whitespace-char-p #\x) will intern the
symbol X, and that's probably not desirable.

Any improvements?

drj
From: Tim Bradshaw
Subject: Re: whitespace-char-p
Date: 
Message-ID: <eu35ge$69t$1$830fa795@news.demon.co.uk>
On 2007-03-24 10:45:38 +0000, ···@pobox.com said:

> This seems incredibly long-winded.
> 
> It also can intern symbols, (whitespace-char-p #\x) will intern the
> symbol X, and that's probably not desirable.

That's the least of your worries.  Unless you know what's in the 
readtable, READ can do *anything at all* that your CL implementation 
can do.  No code that cares about security at all should ever call READ 
unless using a readtable which it controls completely: that is one 
constructed from a standard readtable by adding suitable macros.  In 
this case, of course, you know what is in the readtable and none of 
this is needed.

--tim
From: ···@pobox.com
Subject: Re: whitespace-char-p
Date: 
Message-ID: <1174741830.229733.289210@y80g2000hsf.googlegroups.com>
On Mar 24, 12:27 pm, Tim Bradshaw <····@tfeb.org> wrote:
> On 2007-03-24 10:45:38 +0000, ····@pobox.com said:
>
> > This seems incredibly long-winded.
>
> > It also can intern symbols, (whitespace-char-p #\x) will intern the
> > symbol X, and that's probably not desirable.
>
> That's the least of your worries.  Unless you know what's in the
> readtable, READ can do *anything at all* that your CL implementation
> can do.  No code that cares about security at all should ever call READ
> unless using a readtable which it controls completely: that is one
> constructed from a standard readtable by adding suitable macros.  In
> this case, of course, you know what is in the readtable and none of
> this is needed.

Actually I don't know what's in the readtable; I want a whitespace-
char-p that returns true for #\x if I've made #\x a whitespace
character in the current readtable (or perhaps more importantly, if
someone else has).

I take your point about read, but in this case I'm using READ-FROM-
STRING with a string of length 1.  A bit of thinking should enable us
to bound what can happen in this case.  Note that we know that the
character comprising the string is not a macro character; the other
cases are that it could be an illegal character or one of the escapes,
both of which raise an error, or it be a constituent in which we get a
symbol returned, or it could be whitespace in which case NIL is
returned.  In no case is arbitrary code executed.  Except, and this
_is_ a bug, I don't actually check that the argument x to whitespace-
char-p is a character.  However, I do pass x to both get-macro-
character and coerce and either those of would choke with an error if
x was not a character before we got to READ-FROM-STRING.  Probably.

An addition of (check-type x 'character) would improve matters.

drj
From: Tim Bradshaw
Subject: Re: whitespace-char-p
Date: 
Message-ID: <eu3rrq$7cb$1$8300dec7@news.demon.co.uk>
On 2007-03-24 13:10:30 +0000, ···@pobox.com said:

> I do pass x to both get-macro-
> character

Sorry, hadn't noticed you did that.  Interning symbols is still (as you 
pointed out) pretty bad.
From: Vassil Nikolov
Subject: Re: whitespace-char-p
Date: 
Message-ID: <m3d52wdc6n.fsf@localhost.localdomain>
On Sat, 24 Mar 2007 18:49:30 +0000, Tim Bradshaw <···@tfeb.org> said:
| ...
| Interning symbols is still (as you pointed out) pretty bad.

  Well, we could bind *PACKAGE* to a scratch package, and then
  unintern the resulting symbol if any, but that is rather
  inelegant (and slow)...

  ---Vassil.


-- 
The truly good code is the obviously correct code.
From: Pillsy
Subject: Re: whitespace-char-p
Date: 
Message-ID: <1174799497.537363.230030@y80g2000hsf.googlegroups.com>
On Mar 24, 6:45 am, ····@pobox.com wrote:
[...]
> It also can intern symbols, (whitespace-char-p #\x) will intern the
> symbol X, and that's probably not desirable.

> Any improvements?

You can work around the interning problem using the following trick,
which is, I think, portable. At least, it works on the implementations
I've tried. =)

(defun uninterned-read-from-string (string)
   (funcall (get-dispatch-macro-character #\# #\:)
	    (make-string-input-stream string)
            nil nil))

Cheers,
Pillsy
From: John Thingstad
Subject: Re: whitespace-char-p
Date: 
Message-ID: <op.tpoy9zdxpqzri1@pandora.upc.no>
On Fri, 23 Mar 2007 19:02:54 +0100, <···@pobox.com> wrote:

> In Common Lisp how do I test if a character is a whitespace character
> (in the current readtable)?
>
> I have a solution which I am confident works but I'm not confident
> that it's best, recommended, or idiomatic.
>
> drj
>

(defpackage :my-extensions
   (:shadow whitespace-char-p)) ; LispWorks defines this function in the  
:lispworks package

(in-package :my-extensions)

(defconstant +whitespace-characters+
   '(#\Space #\Tab #\LineFeed #\Return #\FormFeed #\Page))

(defun whitespace-char-p (char)
   (not (null (member char +whitespace-characters+ :test #'char=))))

As you see it dosn't inspect the readtable and misses 'U+3000 Ideographic  
Space' if unicode.
You might want to add that. Beyond that why do you need to read the  
readtable?

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/