Avoiding an http headache

From: Larry Hunter
Subject: Avoiding an http headache
Date: Thu, 03 Apr 2003 16:37:33 +0000
Message-ID: <m3r88jo7ua.fsf@huge.uchsc.edu>

Folks,

I was bitten by this a few weeks ago, and then one of my lab members
spent hours tracking down another instance of this rather painful
problem if fortunately rare problem.

Web pages often include the character #\no-break_space (char code
160). This is not one of the characers included in the standard's
standard character set (section 2.3), but it is increasingly widely
used. In many editors and development environments (e.g. emacs), this
renders as a space, yet the spec's definition of whitespace characters
(figure 2.1.4 and section 2.1.4.7) in standard syntax does not include
it.

Trying to parse that web page, or trying to use code taken from a web
page with such a character in it can lead to hard to debug errors. 

To see the problem, try this:

  (eval (read-from-string (format nil "(quote~afoo)" #\no-break_space)))

I recommend 

 (set-syntax-from-char #\no-break_space #\space)

in your initialization files to avoid any problems associated with
this character.

-- 

Lawrence Hunter, Ph.D.
Director, Center for Computational Pharmacology
Associate Professor of Pharmacology, PMB & Computer Science

phone  +1 303 315 1094           UCHSC, Campus Box C236    
fax    +1 303 315 1098           School of Medicine rm 2817b   
cell   +1 303 324 0355           4200 E. 9th Ave.                 
email: ············@uchsc.edu    Denver, CO 80262       
PGP key on public keyservers     http://compbio.uchsc.edu/hunter

From: Kent M Pitman
Subject: Re: Avoiding an http headache
Date: Thu, 03 Apr 2003 17:32:52 +0000
Message-ID: <sfwadf7bi63.fsf@shell01.TheWorld.com>

Larry Hunter <············@uchsc.edu> writes:

> I was bitten by this a few weeks ago, and then one of my lab members
> spent hours tracking down another instance of this rather painful
> problem if fortunately rare problem.
> 
> Web pages often include the character #\no-break_space (char code
> 160). This is not one of the characers included in the standard's
> standard character set (section 2.3), but it is increasingly widely
> used. In many editors and development environments (e.g. emacs), this
> renders as a space, yet the spec's definition of whitespace characters
> (figure 2.1.4 and section 2.1.4.7) in standard syntax does not include
> it.
> 
> Trying to parse that web page, or trying to use code taken from a web
> page with such a character in it can lead to hard to debug errors. 
> 
> To see the problem, try this:
> 
>   (eval (read-from-string (format nil "(quote~afoo)" #\no-break_space)))
> 
> I recommend 
> 
>  (set-syntax-from-char #\no-break_space #\space)
> 
> in your initialization files to avoid any problems associated with
> this character.

This is an interesting problem because while I don't disagree with the
above analysis, there is also a conflicting analysis that I've hade to
make for myself in other contexts.

If you write a web browser, or TeX, or you implement CSS in some
context, you add characters in boxes iteratively until the right (or,
in the case of right-to-left, maybe I should say "target") margin.  If
there is an overrun, you take the whole box of non-whitespace chars
you are accumulating and move it to the next line.  If there is
whitespace, you iteratively accumulate it into a single piece.  (If
TeX is involved, you have to fuss with penalty structure as you go,
too.)  And, moreover, multiple &nbsp;'s don't contract like regular
whitespace do.  But the point is that &nbsp; clearly counts as an
alphabetic in this context--it happens to have no pixels, but
conceptually it is an alphabetic.  Its whole nature would break if it
were dclared whitespace-like in this context.

Probably you're right that for Lispy purposes, the right thing is as you
say and maybe for layout purposes, it needs a separate predicate.  But
I just wanted to add a bit of defense for why the matter is more complicated
than it seems at first glance.