noob: BOM and UFT-8 in Clisp

From: Pekka Niiranen
Subject: noob: BOM and UFT-8 in Clisp
Date: Sun, 14 Jan 2007 14:00:19 +0000
Message-ID: <45aa3714$0$24628$39db0f71@news.song.fi>

Hi,

I have two files with similar text in them:

	First line
	Sec line
	3rd line

First file is saved as "UTF16-LE" with BOM and I read one
line from it like this:

(with-open-file (stream "/home/pekka/practicalcl/test_utf16.txt"
:external-format "utf-16")
	   (format t "~a~%" (read-line stream)))
First line
NIL

All fine. But when I save the file in "UTF-8" with BOM I get this:

CL-USER> (with-open-file (stream "/home/pekka/practicalcl/test_utf8.txt"
:external-format "utf-8")
	   (format t "~a~%" (read-line stream)))
   First line # Note here the extra blank. Pekka.
NIL

Another try fails too:

CL-USER> (with-open-file (stream /home/pekka/practicalcl/test_utf8.txt")
	   (format t "~a~%" (read-line stream)))
CFirst line  First line # Note here the extra "C". Pekka.
NIL

How can I read "UTF-8" file so that BOM is not processed IF it exists?


Fraction of my .emacs:
----------------- .emacs -------------------------------

;;;; SLIME Setup

(add-to-list 'load-path "C:/home/pekka/slime/")
(set-language-environment "UTF-8")
(setq slime-net-coding-system 'utf-8-unix)
(require 'slime)
(slime-setup :autodoc t)
(setq common-lisp-hyperspec-root "file://C:/home/pekka/HyperSpec/")
(setq inferior-lisp-program "C:/bin/clisp-2.41/clisp.exe -I -q -K full
-E utf-8")

------------------------------------------------

-pekka-

Re: noob: BOM and UFT-8 in Clisp Pascal Bourguignon
Re: noob: BOM and UFT-8 in Clisp Alex Mizrahi
- Re: noob: BOM and UFT-8 in Clisp Pekka Niiranen
  - Re: noob: BOM and UFT-8 in Clisp Ralf Mattes
  - Re: noob: BOM and UFT-8 in Clisp Pascal Bourguignon
    - Re: noob: BOM and UFT-8 in Clisp Ralf Mattes
    - Re: noob: BOM and UFT-8 in Clisp Pekka Niiranen
      - Re: noob: BOM and UFT-8 in Clisp Sam Steingold
      - Re: noob: BOM and UFT-8 in Clisp Ralf Mattes
Re: noob: BOM and UFT-8 in Clisp Thomas Bakketun

From: Pascal Bourguignon
Subject: Re: noob: BOM and UFT-8 in Clisp
Date: Sun, 14 Jan 2007 15:12:02 +0000
Message-ID: <87mz4l3anx.fsf@thalassa.informatimago.com>

Pekka Niiranen <··············@wlanmail.com> writes:
> All fine. But when I save the file in "UTF-8" with BOM I get this:
>
> CL-USER> (with-open-file (stream "/home/pekka/practicalcl/test_utf8.txt"
> :external-format "utf-8")
> 	   (format t "~a~%" (read-line stream)))
>   First line # Note here the extra blank. Pekka.
> NIL


C/USER[13]> (with-open-file (stream "/tmp/a.txt" :external-format "utf-8")
             (read-line stream))
"-*- coding: iso-8859-1 -*-" ;
NIL
C/USER[14]> (values (aref * 0) (aref * 1))
#\ZERO_WIDTH_NO-BREAK_SPACE ;
#\-

I don't see any extra space...  When the BOM is not processed as a
BOM, it should mean ZERO_WIDTH_NO-BREAK_SPACE and that's what it's
taken to mean here.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

"Specifications are for the weak and timid!"

From: Alex Mizrahi
Subject: Re: noob: BOM and UFT-8 in Clisp
Date: Sun, 14 Jan 2007 14:31:48 +0000
Message-ID: <45aa3ed7$0$49208$14726298@news.sunsite.dk>

(message (Hello 'Pekka)
(you :wrote  :on '(Sun, 14 Jan 2007 16:00:19 +0200))
(

 PN> All fine. But when I save the file in "UTF-8" with BOM I get this:

i wonder what is UTF-8 with BOM -- ain't UTF-8 endianness-neutral?

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
"People who lust for the Feel of keys on their fingertips (c) Inity")

From: Pekka Niiranen
Subject: Re: noob: BOM and UFT-8 in Clisp
Date: Sun, 14 Jan 2007 14:37:30 +0000
Message-ID: <45AA402A.8040006@wlanmail.com>

Alex Mizrahi wrote:
> (message (Hello 'Pekka)
> (you :wrote  :on '(Sun, 14 Jan 2007 16:00:19 +0200))
> (
> 
>  PN> All fine. But when I save the file in "UTF-8" with BOM I get this:
> 
> i wonder what is UTF-8 with BOM -- ain't UTF-8 endianness-neutral?

It is, but BOM is useful for the editors. How can one distinguish
between latin1 and UTF8 otherwise?
> 
> )
> (With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
> "People who lust for the Feel of keys on their fingertips (c) Inity") 
> 
>

From: Ralf Mattes
Subject: Re: noob: BOM and UFT-8 in Clisp
Date: Sun, 14 Jan 2007 14:58:15 +0000
Message-ID: <pan.2007.01.14.14.58.15.155175@mh-freiburg.de>

On Sun, 14 Jan 2007 16:37:30 +0200, Pekka Niiranen wrote:

> Alex Mizrahi wrote:
>> (message (Hello 'Pekka)
>> (you :wrote  :on '(Sun, 14 Jan 2007 16:00:19 +0200))
>> (
>> 
>>  PN> All fine. But when I save the file in "UTF-8" with BOM I get this:
>> 
>> i wonder what is UTF-8 with BOM -- ain't UTF-8 endianness-neutral?
> 
> It is, but BOM is useful for the editors. How can one distinguish
> between latin1 and UTF8 otherwise?

The same way it distinguishes between UTF-8 and UTF-16 in the presence of
BOMs ;-)

Seriously: there's no way to abuse a Byte Order Mark to "detect" file
encodings. Use it for what it's designed to be used: to dectect endianess
in files.

As for Clisp complaining: what exactly _is_ that extra space? One
(invisible) character? What bytes are in the file?

 Cheers, ralfD


>> )
>> (With-best-regards '(Alex Mizrahi) :aka 'killer_storm) "People who lust
>> for the Feel of keys on their fingertips (c) Inity") 
>> 
>>

From: Pascal Bourguignon
Subject: Re: noob: BOM and UFT-8 in Clisp
Date: Sun, 14 Jan 2007 15:00:41 +0000
Message-ID: <87r6tx3b6u.fsf@thalassa.informatimago.com>

Pekka Niiranen <··············@wlanmail.com> writes:

> Alex Mizrahi wrote:
>> (message (Hello 'Pekka)
>> (you :wrote  :on '(Sun, 14 Jan 2007 16:00:19 +0200))
>> (
>>
>>  PN> All fine. But when I save the file in "UTF-8" with BOM I get this:
>>
>> i wonder what is UTF-8 with BOM -- ain't UTF-8 endianness-neutral?
>
> It is, but BOM is useful for the editors. How can one distinguish
> between latin1 and UTF8 otherwise?

How do you distinguish between a UTF-8 with BOM file and a ISO-8859-1
file starting with: "" ?

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

Pour moi, la grande question n'a jamais �t�: �Qui suis-je? O� vais-je?� 
comme l'a formul� si adroitement notre ami Pascal, mais plut�t: 
�Comment vais-je m'en tirer?� -- Jean Yanne

From: Ralf Mattes
Subject: Re: noob: BOM and UFT-8 in Clisp
Date: Sun, 14 Jan 2007 15:11:23 +0000
Message-ID: <pan.2007.01.14.15.11.23.129588@mh-freiburg.de>

On Sun, 14 Jan 2007 16:00:41 +0100, Pascal Bourguignon wrote:

> How do you distinguish between a UTF-8 with BOM file and a ISO-8859-1
> file starting with: "" ?

I guess the answer would be: "oh, that hardly ever happens .." ;-)
Just read the unicode BOM FAQ again - and had a good laugh. 

  "Under some higher level protocols, use of a BOM may be mandatory (or
  prohibited) in the Unicode data stream".

It works --- or not. 
So, can we conclude that the BOM is part of the data stream and delivered
to the application? 

 Cheers, RalfD

From: Pekka Niiranen
Subject: Re: noob: BOM and UFT-8 in Clisp
Date: Sun, 14 Jan 2007 15:16:18 +0000
Message-ID: <45aa48e3$0$24607$39db0f71@news.song.fi>

Pascal Bourguignon wrote:
> Pekka Niiranen <··············@wlanmail.com> writes:
> 
>> Alex Mizrahi wrote:
>>> (message (Hello 'Pekka)
>>> (you :wrote  :on '(Sun, 14 Jan 2007 16:00:19 +0200))
>>> (
>>>
>>>  PN> All fine. But when I save the file in "UTF-8" with BOM I get this:
>>>
>>> i wonder what is UTF-8 with BOM -- ain't UTF-8 endianness-neutral?
>> It is, but BOM is useful for the editors. How can one distinguish
>> between latin1 and UTF8 otherwise?
> 
> How do you distinguish between a UTF-8 with BOM file and a ISO-8859-1
> file starting with: "" ?
> 
Fine, but since I told Clisp explicitly the file is in UTF8 I expected
it to skip BOM as it did with UTF16LE.

-pekka-

From: Sam Steingold
Subject: Re: noob: BOM and UFT-8 in Clisp
Date: Sun, 14 Jan 2007 18:48:21 +0000
Message-ID: <m3zm8lv40a.fsf@loiso.podval.org>

> * Pekka Niiranen <··············@jynaznvy.pbz> [2007-01-14 17:16:18 +0200]:
>
> Pascal Bourguignon wrote:
>> Pekka Niiranen <··············@wlanmail.com> writes:
>>
>>> Alex Mizrahi wrote:
>>>> (message (Hello 'Pekka)
>>>> (you :wrote  :on '(Sun, 14 Jan 2007 16:00:19 +0200))
>>>> (
>>>>
>>>>  PN> All fine. But when I save the file in "UTF-8" with BOM I get this:
>>>>
>>>> i wonder what is UTF-8 with BOM -- ain't UTF-8 endianness-neutral?
>>> It is, but BOM is useful for the editors. How can one distinguish
>>> between latin1 and UTF8 otherwise?
>>
>> How do you distinguish between a UTF-8 with BOM file and a ISO-8859-1
>> file starting with: "ï»¿" ?
>>
> Fine, but since I told Clisp explicitly the file is in UTF8 I expected
> it to skip BOM as it did with UTF16LE.

you did not.
you told CLISP that the file is UTF8.
CLISP does not support BOM.

-- 
Sam Steingold (http://sds.podval.org/) on Fedora Core release 6 (Zod)
http://thereligionofpeace.com http://dhimmi.com http://memri.org
http://camera.org http://mideasttruth.com http://pmw.org.il http://ffii.org
Professionalism is being dispassionate about your work.

From: Ralf Mattes
Subject: Re: noob: BOM and UFT-8 in Clisp
Date: Sun, 14 Jan 2007 15:54:49 +0000
Message-ID: <pan.2007.01.14.15.54.48.702597@mh-freiburg.de>

On Sun, 14 Jan 2007 17:16:18 +0200, Pekka Niiranen wrote:

> Pascal Bourguignon wrote:
>> Pekka Niiranen <··············@wlanmail.com> writes:
>> 
>>> Alex Mizrahi wrote:
>>>> (message (Hello 'Pekka)
>>>> (you :wrote  :on '(Sun, 14 Jan 2007 16:00:19 +0200))
>>>> (
>>>>
>>>>  PN> All fine. But when I save the file in "UTF-8" with BOM I get this:
>>>>
>>>> i wonder what is UTF-8 with BOM -- ain't UTF-8 endianness-neutral?
>>> It is, but BOM is useful for the editors. How can one distinguish
>>> between latin1 and UTF8 otherwise?
>> 
>> How do you distinguish between a UTF-8 with BOM file and a ISO-8859-1
>> file starting with: "ï»¿" ?
>> 
> Fine, but since I told Clisp explicitly the file is in UTF8 I expected
> it to skip BOM as it did with UTF16LE.
>

From the unicode book (chapter 13.6, p. 13):

  Where the character set information is explicitly marked, such as
  in UTF-16BE or UTF- 16LE, then all U+FEFF characters, even at the very
  beginning of the text, are to be interpreted as zero width no-break
  spaces. Similarly, where Unicode text has known byte order, initial
  U+FEFF characters are also not required and are to be interpreted as
  zero width no- break spaces. 

HTH RalfD

> -pekka-

From: Thomas Bakketun
Subject: Re: noob: BOM and UFT-8 in Clisp
Date: Mon, 15 Jan 2007 20:09:31 +0000
Message-ID: <5125brF1ihjorU1@mid.individual.net>

Pekka Niiranen wrote:
 
> How can I read "UTF-8" file so that BOM is not processed IF it exists?

Just remove it yourself before reading the first line:

(when (char= (peek-char nil stream) (code-char #xFEFF))
  (read-char stream))