Hi,
I have two files with similar text in them:
First line
Sec line
3rd line
First file is saved as "UTF16-LE" with BOM and I read one
line from it like this:
(with-open-file (stream "/home/pekka/practicalcl/test_utf16.txt"
:external-format "utf-16")
(format t "~a~%" (read-line stream)))
First line
NIL
All fine. But when I save the file in "UTF-8" with BOM I get this:
CL-USER> (with-open-file (stream "/home/pekka/practicalcl/test_utf8.txt"
:external-format "utf-8")
(format t "~a~%" (read-line stream)))
First line # Note here the extra blank. Pekka.
NIL
Another try fails too:
CL-USER> (with-open-file (stream /home/pekka/practicalcl/test_utf8.txt")
(format t "~a~%" (read-line stream)))
CFirst line First line # Note here the extra "C". Pekka.
NIL
How can I read "UTF-8" file so that BOM is not processed IF it exists?
Fraction of my .emacs:
----------------- .emacs -------------------------------
;;;; SLIME Setup
(add-to-list 'load-path "C:/home/pekka/slime/")
(set-language-environment "UTF-8")
(setq slime-net-coding-system 'utf-8-unix)
(require 'slime)
(slime-setup :autodoc t)
(setq common-lisp-hyperspec-root "file://C:/home/pekka/HyperSpec/")
(setq inferior-lisp-program "C:/bin/clisp-2.41/clisp.exe -I -q -K full
-E utf-8")
------------------------------------------------
-pekka-
Pekka Niiranen <··············@wlanmail.com> writes:
> All fine. But when I save the file in "UTF-8" with BOM I get this:
>
> CL-USER> (with-open-file (stream "/home/pekka/practicalcl/test_utf8.txt"
> :external-format "utf-8")
> (format t "~a~%" (read-line stream)))
> First line # Note here the extra blank. Pekka.
> NIL
C/USER[13]> (with-open-file (stream "/tmp/a.txt" :external-format "utf-8")
(read-line stream))
"-*- coding: iso-8859-1 -*-" ;
NIL
C/USER[14]> (values (aref * 0) (aref * 1))
#\ZERO_WIDTH_NO-BREAK_SPACE ;
#\-
I don't see any extra space... When the BOM is not processed as a
BOM, it should mean ZERO_WIDTH_NO-BREAK_SPACE and that's what it's
taken to mean here.
--
__Pascal Bourguignon__ http://www.informatimago.com/
"Specifications are for the weak and timid!"
(message (Hello 'Pekka)
(you :wrote :on '(Sun, 14 Jan 2007 16:00:19 +0200))
(
PN> All fine. But when I save the file in "UTF-8" with BOM I get this:
i wonder what is UTF-8 with BOM -- ain't UTF-8 endianness-neutral?
)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
"People who lust for the Feel of keys on their fingertips (c) Inity")
Alex Mizrahi wrote:
> (message (Hello 'Pekka)
> (you :wrote :on '(Sun, 14 Jan 2007 16:00:19 +0200))
> (
>
> PN> All fine. But when I save the file in "UTF-8" with BOM I get this:
>
> i wonder what is UTF-8 with BOM -- ain't UTF-8 endianness-neutral?
It is, but BOM is useful for the editors. How can one distinguish
between latin1 and UTF8 otherwise?
>
> )
> (With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
> "People who lust for the Feel of keys on their fingertips (c) Inity")
>
>
On Sun, 14 Jan 2007 16:37:30 +0200, Pekka Niiranen wrote:
> Alex Mizrahi wrote:
>> (message (Hello 'Pekka)
>> (you :wrote :on '(Sun, 14 Jan 2007 16:00:19 +0200))
>> (
>>
>> PN> All fine. But when I save the file in "UTF-8" with BOM I get this:
>>
>> i wonder what is UTF-8 with BOM -- ain't UTF-8 endianness-neutral?
>
> It is, but BOM is useful for the editors. How can one distinguish
> between latin1 and UTF8 otherwise?
The same way it distinguishes between UTF-8 and UTF-16 in the presence of
BOMs ;-)
Seriously: there's no way to abuse a Byte Order Mark to "detect" file
encodings. Use it for what it's designed to be used: to dectect endianess
in files.
As for Clisp complaining: what exactly _is_ that extra space? One
(invisible) character? What bytes are in the file?
Cheers, ralfD
>> )
>> (With-best-regards '(Alex Mizrahi) :aka 'killer_storm) "People who lust
>> for the Feel of keys on their fingertips (c) Inity")
>>
>>
Pekka Niiranen <··············@wlanmail.com> writes:
> Alex Mizrahi wrote:
>> (message (Hello 'Pekka)
>> (you :wrote :on '(Sun, 14 Jan 2007 16:00:19 +0200))
>> (
>>
>> PN> All fine. But when I save the file in "UTF-8" with BOM I get this:
>>
>> i wonder what is UTF-8 with BOM -- ain't UTF-8 endianness-neutral?
>
> It is, but BOM is useful for the editors. How can one distinguish
> between latin1 and UTF8 otherwise?
How do you distinguish between a UTF-8 with BOM file and a ISO-8859-1
file starting with: "" ?
--
__Pascal Bourguignon__ http://www.informatimago.com/
Pour moi, la grande question n'a jamais �t�: �Qui suis-je? O� vais-je?�
comme l'a formul� si adroitement notre ami Pascal, mais plut�t:
�Comment vais-je m'en tirer?� -- Jean Yanne
On Sun, 14 Jan 2007 16:00:41 +0100, Pascal Bourguignon wrote:
> How do you distinguish between a UTF-8 with BOM file and a ISO-8859-1
> file starting with: "" ?
I guess the answer would be: "oh, that hardly ever happens .." ;-)
Just read the unicode BOM FAQ again - and had a good laugh.
"Under some higher level protocols, use of a BOM may be mandatory (or
prohibited) in the Unicode data stream".
It works --- or not.
So, can we conclude that the BOM is part of the data stream and delivered
to the application?
Cheers, RalfD
Pascal Bourguignon wrote:
> Pekka Niiranen <··············@wlanmail.com> writes:
>
>> Alex Mizrahi wrote:
>>> (message (Hello 'Pekka)
>>> (you :wrote :on '(Sun, 14 Jan 2007 16:00:19 +0200))
>>> (
>>>
>>> PN> All fine. But when I save the file in "UTF-8" with BOM I get this:
>>>
>>> i wonder what is UTF-8 with BOM -- ain't UTF-8 endianness-neutral?
>> It is, but BOM is useful for the editors. How can one distinguish
>> between latin1 and UTF8 otherwise?
>
> How do you distinguish between a UTF-8 with BOM file and a ISO-8859-1
> file starting with: "" ?
>
Fine, but since I told Clisp explicitly the file is in UTF8 I expected
it to skip BOM as it did with UTF16LE.
-pekka-
> * Pekka Niiranen <··············@jynaznvy.pbz> [2007-01-14 17:16:18 +0200]:
>
> Pascal Bourguignon wrote:
>> Pekka Niiranen <··············@wlanmail.com> writes:
>>
>>> Alex Mizrahi wrote:
>>>> (message (Hello 'Pekka)
>>>> (you :wrote :on '(Sun, 14 Jan 2007 16:00:19 +0200))
>>>> (
>>>>
>>>> PN> All fine. But when I save the file in "UTF-8" with BOM I get this:
>>>>
>>>> i wonder what is UTF-8 with BOM -- ain't UTF-8 endianness-neutral?
>>> It is, but BOM is useful for the editors. How can one distinguish
>>> between latin1 and UTF8 otherwise?
>>
>> How do you distinguish between a UTF-8 with BOM file and a ISO-8859-1
>> file starting with: "" ?
>>
> Fine, but since I told Clisp explicitly the file is in UTF8 I expected
> it to skip BOM as it did with UTF16LE.
you did not.
you told CLISP that the file is UTF8.
CLISP does not support BOM.
--
Sam Steingold (http://sds.podval.org/) on Fedora Core release 6 (Zod)
http://thereligionofpeace.com http://dhimmi.com http://memri.org
http://camera.org http://mideasttruth.com http://pmw.org.il http://ffii.org
Professionalism is being dispassionate about your work.
On Sun, 14 Jan 2007 17:16:18 +0200, Pekka Niiranen wrote:
> Pascal Bourguignon wrote:
>> Pekka Niiranen <··············@wlanmail.com> writes:
>>
>>> Alex Mizrahi wrote:
>>>> (message (Hello 'Pekka)
>>>> (you :wrote :on '(Sun, 14 Jan 2007 16:00:19 +0200))
>>>> (
>>>>
>>>> PN> All fine. But when I save the file in "UTF-8" with BOM I get this:
>>>>
>>>> i wonder what is UTF-8 with BOM -- ain't UTF-8 endianness-neutral?
>>> It is, but BOM is useful for the editors. How can one distinguish
>>> between latin1 and UTF8 otherwise?
>>
>> How do you distinguish between a UTF-8 with BOM file and a ISO-8859-1
>> file starting with: "" ?
>>
> Fine, but since I told Clisp explicitly the file is in UTF8 I expected
> it to skip BOM as it did with UTF16LE.
>
From the unicode book (chapter 13.6, p. 13):
Where the character set information is explicitly marked, such as
in UTF-16BE or UTF- 16LE, then all U+FEFF characters, even at the very
beginning of the text, are to be interpreted as zero width no-break
spaces. Similarly, where Unicode text has known byte order, initial
U+FEFF characters are also not required and are to be interpreted as
zero width no- break spaces.
HTH RalfD
> -pekka-
Pekka Niiranen wrote:
> How can I read "UTF-8" file so that BOM is not processed IF it exists?
Just remove it yourself before reading the first line:
(when (char= (peek-char nil stream) (code-char #xFEFF))
(read-char stream))