From: Matthew X. Economou
Subject: Determining whitespace
Date: 
Message-ID: <w4obs5zhebn.fsf@eco-fs1.irtnog.org>
Is there an implementation-independent way to determine if a character
is considered whitespace?  I'm looking for the equivalent of the
isspace() function in the standard C library, but the permuted symbol
index in the CLHS only lists predicates for alphabetic, digit, and
graphics characters.  There also doesn't seem to be an implementation-
independent way to query the reader for this information.

Am I missing anything obvious?  Will I just have to roll my own
predicate?

-- 
Matthew X. Economou <········@irtnog.org> - Unsafe at any clock speed!
I'm proud of my Northern Tibetian heritage! (http://www.subgenius.com)
Max told his friend that he'd just as soon not go hiking in the hills.
Said he, "I'm an anti-climb Max."  [So is that punchline.]

From: Erik Naggum
Subject: Re: Determining whitespace
Date: 
Message-ID: <3243432163983480@naggum.no>
* Matthew X. Economou
| Is there an implementation-independent way to determine if a character is
| considered whitespace?

  What are you going to do with the result?

-- 
Erik Naggum, Oslo, Norway

Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.
From: Matthew X. Economou
Subject: Re: Determining whitespace
Date: 
Message-ID: <w4o3crbh32a.fsf@eco-fs1.irtnog.org>
>>>>> "Erik" == Erik Naggum <····@naggum.no> writes:

    Erik> What are you going to do with the result?

I'm writing a library function that parses an IP address embedded in a
string.  I'm using PARSE-INTEGER as a model for the function's
behavior.  In addition to being able to operate on sub-strings and
ignoring junk, PARSE-INTEGER ignores leading and trailing whitespace,
and I'd like to do the same, using the same definition of whitespace
as the hosting Lisp implementation if possible.

Is this a reasonable thing to do?

-- 
Matthew X. Economou <········@irtnog.org> - Unsafe at any clock speed!
I'm proud of my Northern Tibetian heritage! (http://www.subgenius.com)
"If it's not on fire, it's a software problem." --Carrie Fish
From: Erik Naggum
Subject: Re: Determining whitespace
Date: 
Message-ID: <3243450731199834@naggum.no>
* Matthew X. Economou
| I'm writing a library function that parses an IP address embedded in a
| string.

  Since an IP address may be several different things, I think the function
  should be separated into two parts: one that searches for an IP address
  (however defined: IPv4, IPv6, abbreviated or full), and several functions
  that accept whatever passes for IP addresses and return the appropriate
  address structure.  I have found that I need CIDR coding with both /n and
  /mask, but in other cases, /port is used.  Sometimes, even .port is used
  (which does not work with abbreviated IP addresses), although I consider
  the smartest choice to be :port with IPv4 and /port with IPv6.  When you
  make this separation of functionality, there should be no need to know
  what the whitespace characters are.  Actually processing everything that
  people do with IP addresses is fascinatingly complex.  Many losers have
  no concern for parsability of the output from their programs.  *sigh*

  Surprisingly often, wanting to know if you look at a whitespace character
  means that you have chosen a less-than-ideal approach to the solution.
  If you parse using a stream, `peek-char� has a skip-whitespace option.

-- 
Erik Naggum, Oslo, Norway

Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.
From: Matthew X. Economou
Subject: Re: Determining whitespace
Date: 
Message-ID: <w4optueff6a.fsf@eco-fs1.irtnog.org>
>>>>> "Erik" == Erik Naggum <····@naggum.no> writes:

    Erik> Since an IP address may be several different things, I think
    Erik> the function should be separated into two parts: one that
    Erik> searches for an IP address (however defined: IPv4, IPv6,
    Erik> abbreviated or full), and several functions that accept
    Erik> whatever passes for IP addresses and return the appropriate
    Erik> address structure.

I think I'm on the right track.  The code I'm writing now (the
PARSE-ADDRESS function) handles only IPv4 dotted-quads.  I thought it
would be a lower-level function suitable for use in a reader macro (or
other user-input routine), just as PARSE-INTEGER seems to be used by
READ.

    Erik> Actually processing everything that people do with IP
    Erik> addresses is fascinatingly complex.

I didn't realize how complicated it could be until I took a look at
the source code to the IP address parsing routines in several
different operating systems and resolver libraries.

    Erik> Surprisingly often, wanting to know if you look at a
    Erik> whitespace character means that you have chosen a
    Erik> less-than-ideal approach to the solution.  If you parse
    Erik> using a stream, `peek-char� has a skip-whitespace option.

I was processing the input string character by character, instead of
converting it to a string-stream.

-- 
Matthew X. Economou <········@irtnog.org> - Unsafe at any clock speed!
I'm proud of my Northern Tibetian heritage! (http://www.subgenius.com)
Max told his friend that he'd just as soon not go hiking in the hills.
Said he, "I'm an anti-climb Max."  [So is that punchline.]
From: Erik Naggum
Subject: Re: Determining whitespace
Date: 
Message-ID: <3243540501105910@naggum.no>
* Matthew X. Economou
| I thought it would be a lower-level function suitable for use in a reader
| macro (or other user-input routine), just as PARSE-INTEGER seems to be
| used by READ.

  `parse-integer� is not used by `read�.

| I didn't realize how complicated it could be until I took a look at the
| source code to the IP address parsing routines in several different
| operating systems and resolver libraries.

  No kidding.  People do so many horrible things you could cry.

| I was processing the input string character by character, instead of
| converting it to a string-stream.

  Ideally, a string-stream should be better from all perspectives, but is
  often much more expensive than need be.

-- 
Erik Naggum, Oslo, Norway

Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.
From: Tim Bradshaw
Subject: Re: Determining whitespace
Date: 
Message-ID: <ey3wuol9a9f.fsf@cley.com>
* Matthew X Economou wrote:

String streams are COOL and should be used for almost everything.
Real Lisp Programmers use string streams instead of lists. If your
implementor makes string streams expensive, complain vigorously.

(half serious)

--tim
From: Johannes Grødem
Subject: Re: Determining whitespace
Date: 
Message-ID: <lzu1jrqqw1.fsf@unity.copyleft.no>
* "Matthew X. Economou" <···············@irtnog.org>:

> Am I missing anything obvious?  Will I just have to roll my own
> predicate?

I've tried to find this as well, but with no luck.  I use the
following to mean white-space, though:

(#\Tab #\Newline #\Linefeed #\Page #\Return #\Space)

I guess there might be cases where want some of these not to count as
whitespace.

(I got these from the table in section 2.1.4 of the HyperSpec.  Those
are the characters listed as having whitespace-syntax type.)

-- 
Johannes Gr�dem <OpenPGP: 5055654C>
From: Johannes Grødem
Subject: Re: Determining whitespace
Date: 
Message-ID: <lzptufqqil.fsf@unity.copyleft.no>
* "Matthew X. Economou" <···············@irtnog.org>:

> Am I missing anything obvious?  Will I just have to roll my own
> predicate?

I've tried to find this as well, but with no luck.  I use the
following to mean white-space, though:

(#\Tab #\Newline #\Linefeed #\Page #\Return #\Space)

I guess there might be cases where you want some of these not to count
as whitespace.

(I got these from the table in section 2.1.4 of the HyperSpec.  These
are the characters listed as having whitespace-syntax type.)

-- 
Johannes Gr�dem <OpenPGP: 5055654C>
From: Matthew X. Economou
Subject: Re: Determining whitespace
Date: 
Message-ID: <w4oy993fgp1.fsf@eco-fs1.irtnog.org>
>>>>> "Johannes" == Johannes Gr�dem <······@ifi.uio.no> writes:

    Johannes> (I got these from the table in section 2.1.4 of the
    Johannes> HyperSpec.  These are the characters listed as having
    Johannes> whitespace-syntax type.)

This gave me an idea.  Since I'm consciously trying to mimic the
behavior of PARSE-INTEGER, especially its ability to parse substrings
via the START and END arguments, I have to manually track my position
within the string.  It would be a lot easier to treat the string as a
string stream via WITH-INPUT-FROM-STRING, as I get both substrings and
bounds checking for free with streams.

The other nice thing this gives me is PEEK-CHAR, which with a peek
type of T, peeks ahead to the first non-whitespace character in the
stream.

I definitely need this behavior at the start of parsing, and I think I
can make it work to end parsing.

Thanks for the help!  I'll be sure to post the code when I'm done.

-- 
Matthew X. Economou <········@irtnog.org> - Unsafe at any clock speed!
I'm proud of my Northern Tibetian heritage! (http://www.subgenius.com)
Max told his friend that he'd just as soon not go hiking in the hills.
Said he, "I'm an anti-climb Max."  [So is that punchline.]
From: Pekka P. Pirinen
Subject: Re: Determining whitespace
Date: 
Message-ID: <uptuddjnb.fsf@globalgraphics.com>
"Matthew X. Economou" <···············@irtnog.org> writes:
> Is there an implementation-independent way to determine if a character
> is considered whitespace?  I'm looking for the equivalent of the
> isspace() function in the standard C library,

Considered by whom?  The thing is, it depends.  In the standard,
there's the whitespace syntax type, and then there's whitespace(1),
which is independent of the readtable (and there's no standard way to
determine if a character is either of those).  But if you're parsing
some non-CL syntax, that's the wrong place to look; you should look at
the definition of that syntax.  And then you should spare a moment to
think about possible extension to character sets other than ASCII: Do
you want, e.g., U+00A0 No-Break Space or U+3000 Ideographic Space?
-- 
Pekka P. Pirinen
In cyberspace, everybody can hear you scream.  - Gary Lewandowski