Encoding bytes into UTF-8 string

From: Robert Dodier
Subject: Encoding bytes into UTF-8 string
Date: Fri, 24 Nov 2006 22:00:00 +0000
Message-ID: <1164405600.649501.117240@45g2000cws.googlegroups.com>

Hello,

I want to read bytes from a file containing UTF-8 characters and
encode them into a string. Specifically, I have the byte offset of
the beginning of the string and the number of bytes in the string
(always a whole number of characters), so I am planning to seek
to the beginning, read so-many bytes, then encode the result into
a string.

I have browsed the CLHS, comp.lang.lisp archives, Seibel's PCL,
and random web pages without coming up with a solution.

One thing that seems promising is CODE-CHAR.
Can I make it recognize UTF-8 codes?

One last thing -- the solution needs to be CL; I'm not in a position
to choose a Lisp implementation.

Thanks in advance for any light you can shed on this question.

Robert Dodier

Re: Encoding bytes into UTF-8 string Graham Fawcett
- Re: Encoding bytes into UTF-8 string ··············@setf.de
Re: Encoding bytes into UTF-8 string Pascal Bourguignon
- Re: Encoding bytes into UTF-8 string llothar
  - Re: Encoding bytes into UTF-8 string Graham Fawcett
    - Re: Encoding bytes into UTF-8 string Harald Hanche-Olsen
      - Re: Encoding bytes into UTF-8 string ······@gmail.com
        Re: Encoding bytes into UTF-8 string Harald Hanche-Olsen
        Re: Encoding bytes into UTF-8 string llothar
        Re: Encoding bytes into UTF-8 string ······@gmail.com
        Re: Encoding bytes into UTF-8 string llothar
        Re: Encoding bytes into UTF-8 string ······@gmail.com
        Re: Encoding bytes into UTF-8 string Harald Hanche-Olsen
        Re: Encoding bytes into UTF-8 string Rob Warnock
        Re: Encoding bytes into UTF-8 string Lars Brinkhoff
      - Re: Encoding bytes into UTF-8 string Pascal Bourguignon
        Re: Encoding bytes into UTF-8 string Harald Hanche-Olsen
        Re: Encoding bytes into UTF-8 string Pascal Bourguignon
        Re: Encoding bytes into UTF-8 string Harald Hanche-Olsen
        Re: Encoding bytes into UTF-8 string Pascal Bourguignon
      - Re: Encoding bytes into UTF-8 string Pisin Bootvong
        Re: Encoding bytes into UTF-8 string Zach Beane
        Re: Encoding bytes into UTF-8 string Wolfram Fenske
- Re: Encoding bytes into UTF-8 string Robert Dodier
  - Re: Encoding bytes into UTF-8 string Pascal Bourguignon
Re: Encoding bytes into UTF-8 string Rafal Strzalinski

From: Graham Fawcett
Subject: Re: Encoding bytes into UTF-8 string
Date: Sat, 25 Nov 2006 00:53:23 +0000
Message-ID: <1164416003.544078.133170@j44g2000cwa.googlegroups.com>

Robert Dodier wrote:
> Hello,
>
> I want to read bytes from a file containing UTF-8 characters and
> encode them into a string. [snip] I'm not in a position
> to choose a Lisp implementation.

If your implementation doesn't provide Unicode support, how about IBM's
International Component for Unicode (ICU) library? I've never used it;
perhaps someone else here has.

http://icu.sourceforge.net/userguide/strings.html

for examples. It might require a bit of FFI, but I imagine it's quite a
solid library.

Graham

From: ··············@setf.de
Subject: Re: Encoding bytes into UTF-8 string
Date: Sat, 25 Nov 2006 20:23:37 +0000
Message-ID: <1164486217.583911.299680@j72g2000cwa.googlegroups.com>

cl-xml includes a portable unicode implementation.

From: Pascal Bourguignon
Subject: Re: Encoding bytes into UTF-8 string
Date: Fri, 24 Nov 2006 22:40:29 +0000
Message-ID: <87ac2gii3m.fsf@thalassa.informatimago.com>

"Robert Dodier" <·············@gmail.com> writes:
> I want to read bytes from a file containing UTF-8 characters and
> encode them into a string. Specifically, I have the byte offset of
> the beginning of the string and the number of bytes in the string
> (always a whole number of characters), so I am planning to seek
> to the beginning, read so-many bytes, then encode the result into
> a string.

You've got your concepts all over wrong.  That can't work.

The file doesn't contain character.  Not on a POSIX system.  Files in
POSIX (including unix and MS-Windows) don't contain characters. They
contain only bytes.

These bytes may encode a string of characters using the utf-8 unicode
coding system.  But you'll have to read bytes.

> I have browsed the CLHS, comp.lang.lisp archives, Seibel's PCL,
> and random web pages without coming up with a solution.
>
> One thing that seems promising is CODE-CHAR.

It's not promising at all.  There's absolutely no guarantee of what
CODE-CHAR does, with respect to utf-8 or unicode.

> Can I make it recognize UTF-8 codes?

No.

> One last thing -- the solution needs to be CL; I'm not in a position
> to choose a Lisp implementation.

Outch!
You'll have to implement UTF-8 to unicode decoding, and unicode to character.

Since you cannot choose a Lisp implementation, you can count only on
the standard characters:

  #\NEWLINE  #\SPACE 
  #\!  #\"  #\#  #\$  #\%  #\&  #\'  #\(  #\)  #\*  #\+  #\,  #\-  #\.  #\/  
  #\0  #\1  #\2  #\3  #\4  #\5  #\6  #\7  #\8  #\9  #\:  #\;  #\<  #\=  #\>  #\?  
  ··@  #\A  #\B  #\C  #\D  #\E  #\F  #\G  #\H  #\I  #\J  #\K  #\L  #\M  #\N  #\O  
  #\P  #\Q  #\R  #\S  #\T  #\U  #\V  #\W  #\X  #\Y  #\Z  #\[  #\\  #\]  #\^  #\_  
  #\`  #\a  #\b  #\c  #\d  #\e  #\f  #\g  #\h  #\i  #\j  #\k  #\l  #\m  #\n  #\o  
  #\p  #\q  #\r  #\s  #\t  #\u  #\v  #\w  #\x  #\y  #\z  #\{  #\|  #\}  #\~  

that's all. So in pure CL, independant of an implementation, you'll be
able to decode utf-8 to unicode and to decode only the unicode that
are between 32 and 126, and the newline to these characters.

Since in a utf-8 stream, the bytes less than 128 encode these
characters, and only these characters are encoded to a sequence of
bytes less than 128, you could actually skip the utf-8 decoding, just
signaling an error on any byte greater than 128.

> Thanks in advance for any light you can shed on this question.

If I were you, I'd try to get to use either sbcl or clisp (or both),
read the file as a binary file :external-format '(unsigned-byte 8),
seek to the _byte_ offset you're given, then use #+sbcl
sb-ext:octets-to-string or #+clisp ext:string-from-bytes to DECODE the
bytes and get a string of unicode characters.

http://www.cliki.net/CloserLookAtCharacters

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
In deep sleep hear sound,
Cat vomit hairball somewhere.
Will find in morning.

From: llothar
Subject: Re: Encoding bytes into UTF-8 string
Date: Sun, 26 Nov 2006 00:51:55 +0000
Message-ID: <1164502315.178945.325640@j72g2000cwa.googlegroups.com>

Pascal Bourguignon schrieb:

> You've got your concepts all over wrong.  That can't work.
>
>
> The file doesn't contain character.  Not on a POSIX system.  Files in
> POSIX (including unix and MS-Windows) don't contain characters. They
> contain only bytes.

One more guy who absolutely don't understand the base concept of UTF-8.

In UTF-8 a character is a byte unless you decide that in a special
function you need to care about non-ascii characters then you start to
look at multibyte sequences.

From: Graham Fawcett
Subject: Re: Encoding bytes into UTF-8 string
Date: Sun, 26 Nov 2006 01:03:26 +0000
Message-ID: <1164503006.232705.82190@l12g2000cwl.googlegroups.com>

llothar wrote:
> One more guy who absolutely don't understand the base concept of UTF-8.
> In UTF-8 a character is a byte unless you decide that in a special
> function you need to care about non-ascii characters then you start to
> look at multibyte sequences.

Right. And Unicode is a subset of ASCII, unless you decide that you
need to care about non-uppercase Roman letters.

In UTF-8, each character is encoded as a sequence of 1 to 4 octets.
Period. "Deciding what you care about" and "special functions" have
nothing to do with it. 

--G

From: Harald Hanche-Olsen
Subject: Re: Encoding bytes into UTF-8 string
Date: Sun, 26 Nov 2006 10:09:28 +0000
Message-ID: <pco64d25xk7.fsf@shuttle.math.ntnu.no>

+ "Graham Fawcett" <··············@gmail.com>:

| In UTF-8, each character is encoded as a sequence of 1 to 4
| octets. Period. "Deciding what you care about" and "special
| functions" have nothing to do with it.

Indeed.  But if we are to bend over backwards to try to assign some
meaning to llothar's point, maybe it is this: Once in a while, an
application doesn't need to look inside UTF-8 text other than in order
to recognize a few ASCII characters in it.  And then, since the ASCII
characters look like what you would expect in non-Unicode text, and
nothing else in a Unicode byte string looks like an ASCII character,
your application can get away with not knowing that it does, in fact,
deal with Unicode.

I have successfully written UTF-8 encoded HTML files using CMUCL using
this method, just relying on CMUCL using 8-bit characters and being
agnostic about the non-ASCII portion of it.

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell

From: ······@gmail.com
Subject: Re: Encoding bytes into UTF-8 string
Date: Sun, 26 Nov 2006 13:32:45 +0000
Message-ID: <1164547965.650262.121170@45g2000cws.googlegroups.com>

Harald Hanche-Olsen wrote:
> But if we are to bend over backwards to try to assign some
> meaning to llothar's point

Why would you want to do that ("to bend over backwards to try to
assign...")? To encourage further senseless postings of ignorant jerks
with the newfound hope that even when they post complete rubbish
someone would find a "secret meaning" in it?

From: Harald Hanche-Olsen
Subject: Re: Encoding bytes into UTF-8 string
Date: Sun, 26 Nov 2006 18:57:26 +0000
Message-ID: <pcomz6e3ujt.fsf@shuttle.math.ntnu.no>

+ ······@gmail.com:

| Harald Hanche-Olsen wrote:
|> But if we are to bend over backwards to try to assign some
|> meaning to llothar's point
|
| Why would you want to do that ("to bend over backwards to try to
| assign...")? To encourage further senseless postings of ignorant jerks
| with the newfound hope that even when they post complete rubbish
| someone would find a "secret meaning" in it?

The point I wished to make was in fact rather tangential to what the
ignorant jerk (and I agree with your assessment there) was trying to
say, so I could indeed have skipped the "bend over backwards" bit.
No, it is not a good idea to encourage such behaviour.  May I go now?

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell

From: llothar
Subject: Re: Encoding bytes into UTF-8 string
Date: Mon, 27 Nov 2006 18:07:31 +0000
Message-ID: <1164650851.793042.309630@45g2000cws.googlegroups.com>

······@gmail.com schrieb:

> Harald Hanche-Olsen wrote:
> > But if we are to bend over backwards to try to assign some
> > meaning to llothar's point
>
> Why would you want to do that ("to bend over backwards to try to
> assign...")? To encourage further senseless postings of ignorant jerks
> with the newfound hope that even when they post complete rubbish
> someone would find a "secret meaning" in it?

Because this is the reason why people started using UTF-8 and why it is
now the base of all modern operating systems (the ones designed before
1992 - which does not include the WinNT branch).

It combines simplicity, fast parsing, very few no backward
compatibility problems, memory efficiency and easy processing rules. I
never understand why i should life with the UCS-16 or UCS-32 encoding
penality in my application when there is such an easy way to handle it.

I maintain an IDE and an editor so i know very well how few cases need
special treatment.
In my opinion much less then 5% of an app depend on multibyte issues.

Unfortunately there days making balanced comprimises seems to be hard
to accept for a lot of programmers. Explaining the geniality of UTF-8
is hard unless you start coding with it. If you ever had to hack around
the Unicode problems in windows (anybody heared about surrogate
characters ?) you appreciate and whish there was as UTF-8 layer.

From: ······@gmail.com
Subject: Re: Encoding bytes into UTF-8 string
Date: Wed, 29 Nov 2006 14:51:53 +0000
Message-ID: <1164811913.868792.41130@j72g2000cwa.googlegroups.com>

llothar wrote:
> Because this is the reason why people started using UTF-8
First of all, this has nothing to do with your original post. Going
back it's clear that you've mixed the concepts of characters, encodings
and application programming and on top of that accused other person of
not grasping the concept.

> It combines simplicity, fast parsing, very few no backward
> compatibility problems, memory efficiency and easy processing rules. I
> never understand why i should life with the UCS-16 or UCS-32 encoding
> penality in my application when there is such an easy way to handle it.
>
The conclusion to be made here is that you've the idea that "character"
equals "byte" very deeply ingrained in your mind and that you are
considering everything from that point of view and judging things by
how easily they allow to *not* adjust that mind model. Also you're
still mixing some concepts, for example you're praising UTF-8 for
memory efficiency, but in fact it only applies when the relative
frequency of ASCII characters in you text is high enough. Basically
you're using UTF-8 as an ad hoc compression scheme based on your
perceived character frequencies. Even if this fits your application
domain it is still wrong for others, and probably suboptimal for yours.

>
> I maintain an IDE and an editor so i know very well how few cases need
> special treatment.
> In my opinion much less then 5% of an app depend on multibyte issues.
Ever considered that when using for example 4-byte encoding for every
character only 0% of an app "depend on multibyte issues"?

> Explaining the geniality of UTF-8
> is hard unless you start coding with it.
Oh, that's very easy. Just start your explanation by the following
"Imagine that you have a weird notion that character *must* be a 8-bit
byte...."

From: llothar
Subject: Re: Encoding bytes into UTF-8 string
Date: Wed, 29 Nov 2006 23:57:42 +0000
Message-ID: <1164844662.765029.270020@h54g2000cwb.googlegroups.com>

······@gmail.com wrote: A message from another planet.

You are right if you are one of the introverted lispers hanging around
here who don't have programs that interact a lot of other
infrastructure, especially libraries, system calls, ffi etc.

Unfortunately the world started with 1 char = 1 byte and thats where we
continue and your ideal thing will never hapen because now we have
UTF-8 as the basic for all Unix system calls and therefore more or less
in the whole programming world.

From: ······@gmail.com
Subject: Re: Encoding bytes into UTF-8 string
Date: Thu, 30 Nov 2006 11:34:16 +0000
Message-ID: <1164886456.594347.326050@80g2000cwy.googlegroups.com>

llothar wrote:
> ······@gmail.com wrote: A message from another planet.
>
> You are right if you are one of the introverted lispers hanging around
> here who don't have programs that interact a lot of other
> infrastructure, especially libraries, system calls, ffi etc.

Ok, so now we got your claims down to "UTF-8 may be useful to interact
with existing legacy infrastructure". This is a much more

agreeable statement.

>
> Unfortunately the world started with 1 char = 1 byte and thats where we
> continue
Leaving aside an amusing theory of how the world started (I believe the
classic goes something like "In the beginning there was the

Word, and Word had two Bytes and
there was nothing else." :) ...

> and your ideal thing will never hapen because now we have
Consider upgrading your world to the 21st century

From: Harald Hanche-Olsen
Subject: Re: Encoding bytes into UTF-8 string
Date: Thu, 30 Nov 2006 14:54:49 +0000
Message-ID: <pco4pshgf2e.fsf@shuttle.math.ntnu.no>

+ ······@gmail.com:

| Leaving aside an amusing theory of how the world started (I believe
| the classic goes something like "In the beginning there was the
| Word, and Word had two Bytes and there was nothing else." :) ...

Bytes?  I am not at all sure when bytes entered the picture.
Certainly, when I was first introduced to computing, the Word was 24
bits long.  We never heard about bytes, but maybe they were too
esoteric for mere undergraduates.  The character set used 6 bits per
character.  And the machine in question was a CDC 3300.  It had real
core memory, and a whopping 15 bit address space.

  http://en.wikipedia.org/wiki/CDC_3000

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell

From: Rob Warnock
Subject: Re: Encoding bytes into UTF-8 string
Date: Fri, 01 Dec 2006 06:42:49 +0000
Message-ID: <m8OdnbZqnPd0U_LYnZ2dnUVZ_oOdnZ2d@speakeasy.net>

Harald Hanche-Olsen  <······@math.ntnu.no> wrote:
+---------------
| + ······@gmail.com:
| | Leaving aside an amusing theory of how the world started (I believe
| | the classic goes something like "In the beginning there was the
| | Word, and Word had two Bytes and there was nothing else." :) ...
| 
| Bytes?  I am not at all sure when bytes entered the picture.
| Certainly, when I was first introduced to computing, the Word was 24
| bits long.  We never heard about bytes, but maybe they were too
| esoteric for mere undergraduates.
+---------------

Maybe. The DEC PDP-10 <http://en.wikipedia.org/wiki/PDP-10> had
hardware support for *variable-width* bytes!! [In fact, the ANSI
Common Lisp LDB & DPB functions are named after the corresponding
PDP-10 instructions.] The base machine word was 36 bits, and a byte
could be anything between 0 & 36 bits wide. Byte operations had to
indirect through a byte-pointer word which contained the byte's
position-within-word (P) [bits from the *right* of the word],
width-in-bits (S), and word address (Y), with the usual PDP-10
indirection (I) & indexing (X) allowed on the word address.
[A zero-width byte always gave you a zero result for LDB, and was
a no-op for DPB. I've used that trick to good effect on occasion!]
A byte pointer looked like this:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     P     |     S     | |I|   X   |                Y                  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 0         5 6        11  13 14   17 18                               35

The ILDB and IDPB instructions incremented (destructively modified)
the byte-pointer word first before doing the LDB or DPB function, so
sucessive ILDB/IDPB instructions would step through sucessive bytes
in memory of whatever size was specified in the byte pointer.

    (defun increment-byte-pointer (bp)
      (when (minusp (decf (byte-pointer-p bp) (byte-pointer-s bp)))
	(setf (byte-pointer-p bp) (- 36 (byte-pointer-s bp)))
	(incf (byte-pointer-y bp)))
      bp)

Bytes could not be split across words. Normal character strings were
7-bit ASCII (upper/lower/etc.), which meant you got 5 characters per
36-bit word, with one bit wasted.[1] It was common to initialize byte
pointers for text strings to P=36 S=7 Y=<string_addr>, that is, pointing
to the non-existent byte to the left of the first word of the string,
so that the first ILDB would step into the first actual character of
the string and successive ILDBs would continue onwards from there.

[Also see <http://pdp10.nocrew.org/docs/instruction-set/Byte.html>.]

System calls & filenames, however, used the SIXBIT character set,
with six 6-bit characters per word. SIXBIT only allowed upper-case
letters, which is why PDP-10 filenames (and later, MS-DOS, which
copied it) were only uppercase.

    > (defun ascii-char-sixbit (x)
	(let ((c (char-code x)))
	  (if (and (>= c 32) (<= c 95))
	    (logxor 32 (logand 63 c))
	    (error "Character '~c' (~d) cannot be converted to SIXBIT." x c))))

    ASCII-CHAR-SIXBIT
    > (map 'list 'ascii-char-sixbit "FOOBAR.BAZ")

    (38 47 47 34 33 50 14 34 33 58)
    > (map 'list 'ascii-char-sixbit "hello")
    Error: 
    Error in function ASCII-CHAR-SIXBIT:
       Character 'h' (104) cannot be converted to SIXBIT.
    ...

And ISTR that somebody built a C compiler for the PDP-10 that
used 9-bit bytes for characters, so you packed four to a word,
but I'm not entirely sure about that.

Anyway, 6 & 7 were the most common byte sizes used in the PDP-10,
though other sizes were routinely used for various specialized
functions [e.g., lexical parsing tables could make very good
use of the indexing in a byte pointers to store "byte-strips"
of various sizes indexed by character (Google for me & FOCAL)].

-Rob

[1] Well, some text editors used a hack of tagging magic "line
    number" words (containing 5 decimal digits in ASCII) with a
    "1" in bit 35, to make it easier(?) to find the next line.
    But except for that, bit 35 was usually zero (wasted) in
    ASCII strings.

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: Lars Brinkhoff
Subject: Re: Encoding bytes into UTF-8 string
Date: Fri, 01 Dec 2006 07:17:14 +0000
Message-ID: <8564cw84qt.fsf@junk.nocrew.org>

····@rpw3.org (Rob Warnock) writes:
> And ISTR that somebody built a C compiler for the PDP-10 that used
> 9-bit bytes for characters, so you packed four to a word, but I'm
> not entirely sure about that.

I believe KCC supports characters of size 6, 7, 8, and 9 bits.  I know
my GCC port does.

From: Pascal Bourguignon
Subject: Re: Encoding bytes into UTF-8 string
Date: Sun, 26 Nov 2006 14:22:40 +0000
Message-ID: <87zmaee18v.fsf@thalassa.informatimago.com>

Harald Hanche-Olsen <······@math.ntnu.no> writes:

> + "Graham Fawcett" <··············@gmail.com>:
>
> | In UTF-8, each character is encoded as a sequence of 1 to 4
> | octets. Period. "Deciding what you care about" and "special
> | functions" have nothing to do with it.
>
> Indeed.  But if we are to bend over backwards to try to assign some
> meaning to llothar's point, maybe it is this: Once in a while, an
> application doesn't need to look inside UTF-8 text other than in order
> to recognize a few ASCII characters in it.  And then, since the ASCII
> characters look like what you would expect in non-Unicode text, and
> nothing else in a Unicode byte string looks like an ASCII character,

It doesn't.  ASCII only uses 7 bits.   At best, it could look like an
ISO-8859-1 encoded character string (but with a lot of control
codes...).


> your application can get away with not knowing that it does, in fact,
> deal with Unicode.
>
> I have successfully written UTF-8 encoded HTML files using CMUCL using
> this method, just relying on CMUCL using 8-bit characters and being
> agnostic about the non-ASCII portion of it.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
        Un chat errant
se soulage
        dans le jardin d'hiver
                                        Shiki

From: Harald Hanche-Olsen
Subject: Re: Encoding bytes into UTF-8 string
Date: Sun, 26 Nov 2006 18:52:24 +0000
Message-ID: <pcor6vq3us7.fsf@shuttle.math.ntnu.no>

+ Pascal Bourguignon <···@informatimago.com>:

| Harald Hanche-Olsen <······@math.ntnu.no> writes:
|
|> And then, since the ASCII
|> characters look like what you would expect in non-Unicode text, and
|> nothing else in a Unicode byte string looks like an ASCII character,
|
| It doesn't.  ASCII only uses 7 bits.

I know that, but I find the common practice of saying ASCII about
stuff that is encoded in 8-bit bytes with the top bit set to zero and
the remaining bits interpreted as true ASCII to be quite harmless and
indeed useful, though it is strictly speaking incorrect.

| At best, it could look like an ISO-8859-1 encoded character string
| (but with a lot of control codes...).

Or ISO-8859-x for various values of x, or any of another number of
character sets (mac, dos, windows ...).  My point is that sometimes
your application doesn't need to care about the meaning of
"characters" with codes in the 128-255 range.  But using this in
programs is /at best/ a stopgap measure until everything is unicode
aware, so it had better be used carefully and sparingly.

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell

From: Pascal Bourguignon
Subject: Re: Encoding bytes into UTF-8 string
Date: Sun, 26 Nov 2006 20:11:15 +0000
Message-ID: <873b86dl3w.fsf@thalassa.informatimago.com>

Harald Hanche-Olsen <······@math.ntnu.no> writes:

> + Pascal Bourguignon <···@informatimago.com>:
>
> | Harald Hanche-Olsen <······@math.ntnu.no> writes:
> |
> |> And then, since the ASCII
> |> characters look like what you would expect in non-Unicode text, and
> |> nothing else in a Unicode byte string looks like an ASCII character,
> |
> | It doesn't.  ASCII only uses 7 bits.
>
> I know that, but I find the common practice of saying ASCII about
> stuff that is encoded in 8-bit bytes with the top bit set to zero and
> the remaining bits interpreted as true ASCII to be quite harmless and
> indeed useful, though it is strictly speaking incorrect.

Perhaps but we're not discussion this.  We're discussing a stream of
byte where the 8th bit is set for some bytes (since it's actually a
UTF-8 stream).  So it's not harmless at all to make the error of
naming it an ASCII stream.

> | At best, it could look like an ISO-8859-1 encoded character string
> | (but with a lot of control codes...).
>
> Or ISO-8859-x for various values of x, or any of another number of
> character sets (mac, dos, windows ...).  My point is that sometimes
> your application doesn't need to care about the meaning of
> "characters" with codes in the 128-255 range.  

If that's the case, then use (unsigned-byte 8) instead of character as
element-type.

> But using this in programs is /at best/ a stopgap measure until
> everything is unicode aware, so it had better be used carefully and
> sparingly.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

HANDLE WITH EXTREME CARE: This product contains minute electrically
charged particles moving at velocities in excess of five hundred
million miles per hour.

From: Harald Hanche-Olsen
Subject: Re: Encoding bytes into UTF-8 string
Date: Sun, 26 Nov 2006 22:34:44 +0000
Message-ID: <pcor6vp7s6z.fsf@shuttle.math.ntnu.no>

+ Pascal Bourguignon <···@informatimago.com>:

| Perhaps but we're not discussion this.  We're discussing a stream of
| byte where the 8th bit is set for some bytes (since it's actually a
| UTF-8 stream).  So it's not harmless at all to make the error of
| naming it an ASCII stream.

I never dig suggest that.  If you think I did, I must have chosen my
wording poorly.

|> My point is that sometimes
|> your application doesn't need to care about the meaning of
|> "characters" with codes in the 128-255 range.  
|
| If that's the case, then use (unsigned-byte 8) instead of character
| as element-type.

Yes, that is certainly more portable (the other way, though useful, is
not at all portable), but could be more cumbersome: Say, if you wish
to parse *ML in which all the markup (excluding attribute values) uses
ASCII characters, while contents may not.

(I notice that the example code for using mod_lisp from cmucl appears
to rely on this hack, so it's not like it is my own invention.)

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell

From: Pascal Bourguignon
Subject: Re: Encoding bytes into UTF-8 string
Date: Sun, 26 Nov 2006 23:36:29 +0000
Message-ID: <87y7pxdblu.fsf@thalassa.informatimago.com>

Harald Hanche-Olsen <······@math.ntnu.no> writes:
> |> My point is that sometimes
> |> your application doesn't need to care about the meaning of
> |> "characters" with codes in the 128-255 range.  
> |
> | If that's the case, then use (unsigned-byte 8) instead of character
> | as element-type.
>
> Yes, that is certainly more portable (the other way, though useful, is
> not at all portable), but could be more cumbersome: Say, if you wish
> to parse *ML in which all the markup (excluding attribute values) uses
> ASCII characters, while contents may not.

And I should mention that apart from the reader syntax "blah", there's
very little string specific functions in CL.  Most of the string
processing one can do in Common Lisp is actually done with sequence or
vector functions.

So using a special reader macro, or just something like:

      #.(ascii-bytes "literal bytes")

to read literal byte vectors in sources, and implementing:

ascii-bytes= ascii-bytes/= 
ascii-bytes< ascii-bytes<= 
ascii-bytes> ascii-bytes>=
ascii-bytes-equal ascii-bytes-not-equal 
ascii-bytes-lessp ascii-bytes-not-lessp
ascii-bytes-greaterp ascii-bytes-not-greaterp
ascii-bytes-trim 
ascii-bytes-left-trim ascii-bytes-right-trim

would be all is required to avoid the overhead of CHARACTER
(and get the same performance and dumbness as C).

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

"This machine is a piece of GAGH!  I need dual Opteron 850
processors if I am to do battle with this code!"

From: Pisin Bootvong
Subject: Re: Encoding bytes into UTF-8 string
Date: Thu, 30 Nov 2006 14:12:50 +0000
Message-ID: <1164895970.599627.144440@n67g2000cwd.googlegroups.com>

Harald Hanche-Olsen wrote:
> + "Graham Fawcett" <··············@gmail.com>:
>
> | In UTF-8, each character is encoded as a sequence of 1 to 4
> | octets. Period. "Deciding what you care about" and "special
> | functions" have nothing to do with it.
>
> Indeed.  But if we are to bend over backwards to try to assign some
> meaning to llothar's point, maybe it is this: Once in a while, an
> application doesn't need to look inside UTF-8 text other than in order
> to recognize a few ASCII characters in it.  And then, since the ASCII
> characters look like what you would expect in non-Unicode text, and
> nothing else in a Unicode byte string looks like an ASCII character,
> your application can get away with not knowing that it does, in fact,
> deal with Unicode.
>

No, for example, count all '0' (0x30) character in a UTF-8 file using
ASCII code search.

If you are to search for an ascii character '0'(0x30) in unicode text
you may mistakenly take some unicode character which is two bytes long,
for example \uAA30 where the first byte is not valid ASCII but the
second byte is '0'.

Not any character in unicode looks like an ASCII character, but one
byte of it may looks like an ASCII byte.

> I have successfully written UTF-8 encoded HTML files using CMUCL using
> this method, just relying on CMUCL using 8-bit characters and being
> agnostic about the non-ASCII portion of it.
>
> --
> * Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
> - It is undesirable to believe a proposition
>   when there is no ground whatsoever for supposing it is true.
>   -- Bertrand Russell

From: Zach Beane
Subject: Re: Encoding bytes into UTF-8 string
Date: Thu, 30 Nov 2006 14:24:53 +0000
Message-ID: <m3ac29c8qy.fsf@unnamed.xach.com>

"Pisin Bootvong" <··········@gmail.com> writes:

> No, for example, count all '0' (0x30) character in a UTF-8 file using
> ASCII code search.
> 
> If you are to search for an ascii character '0'(0x30) in unicode text
> you may mistakenly take some unicode character which is two bytes long,
> for example \uAA30 where the first byte is not valid ASCII but the
> second byte is '0'.
> 
> Not any character in unicode looks like an ASCII character, but one
> byte of it may looks like an ASCII byte.

Incorrect. The UTF-8 encoding is designed to avoid this situation. No
octet used in encoding U+AA30 corresponds to any ASCII character.

Zach

From: Wolfram Fenske
Subject: Re: Encoding bytes into UTF-8 string
Date: Thu, 30 Nov 2006 23:38:57 +0000
Message-ID: <1164929937.248290.62830@79g2000cws.googlegroups.com>

Zach Beane <····@xach.com> writes:

> "Pisin Bootvong" <··········@gmail.com> writes:
>
>> No, for example, count all '0' (0x30) character in a UTF-8 file using
>> ASCII code search.
>>
>> If you are to search for an ascii character '0'(0x30) in unicode text
>> you may mistakenly take some unicode character which is two bytes long,
>> for example \uAA30 where the first byte is not valid ASCII but the
>> second byte is '0'.

No.  In a UTF-8 stream the codepoint U+AA30 would not be encoded as
the byte sequence 0xAA 0x30 but as something else (I'm too lazy to
figure it out right now.  You can look up the algorithm in the
Wikipedia if you like.).

>> Not any character in unicode looks like an ASCII character, but one
>> byte of it may looks like an ASCII byte.
>
> Incorrect. The UTF-8 encoding is designed to avoid this situation. No
> octet used in encoding U+AA30 corresponds to any ASCII character.

Right.  Every byte of a multi-byte sequence has its highest bit set.
This way it cannot be confused with an ASCII character.  The only
bytes in a UTF-8 stream that do look like ASCII characters
(i. e. bytes whose highest bit is not set) are in fact ASCII
characters.  That and the fact that UTF-8 streams don't contain
zero-bytes [1] is what makes UTF-8 so attractive for use in legacy
systems.

Footnotes:
[1]  unlike other Unicode encodings

--
Wolfram Fenske

A: Yes.
>Q: Are you sure?
>>A: Because it reverses the logical flow of conversation.
>>>Q: Why is top posting frowned upon?

From: Robert Dodier
Subject: Re: Encoding bytes into UTF-8 string
Date: Mon, 27 Nov 2006 00:35:08 +0000
Message-ID: <1164587708.488030.276350@l39g2000cwd.googlegroups.com>

Pascal Bourguignon wrote:

> "Robert Dodier" <·············@gmail.com> writes:
> > I want to read bytes from a file containing UTF-8 characters and
> > encode them into a string. Specifically, I have the byte offset of
> > the beginning of the string and the number of bytes in the string
> > (always a whole number of characters), so I am planning to seek
> > to the beginning, read so-many bytes, then encode the result into
> > a string.
>
> You've got your concepts all over wrong.  That can't work.

I knew someone would give me the traditional greeting:
"Welcome to comp.lang.lisp. You're an idiot."

> The file doesn't contain character.  Not on a POSIX system.  Files in
> POSIX (including unix and MS-Windows) don't contain characters. They
> contain only bytes.

Quibbling is doubtless a waste of time here, but: a file contains
whatever I want it to contain, and, by Odin's beard, if I want it
to contain characters, then so it is.

> These bytes may encode a string of characters using the utf-8 unicode
> coding system.  But you'll have to read bytes.

I was thinking something like "I want to read bytes" and "I am
planning to seek to the beginning, [and] read so-many bytes";
but I guess I'll have to read bytes instead.

Robert Dodier

From: Pascal Bourguignon
Subject: Re: Encoding bytes into UTF-8 string
Date: Mon, 27 Nov 2006 04:15:56 +0000
Message-ID: <87u00lcyo3.fsf@thalassa.informatimago.com>

"Robert Dodier" <·············@gmail.com> writes:

> Pascal Bourguignon wrote:
>
>> "Robert Dodier" <·············@gmail.com> writes:
>> > I want to read bytes from a file containing UTF-8 characters and
>> > encode them into a string. Specifically, I have the byte offset of
>> > the beginning of the string and the number of bytes in the string
>> > (always a whole number of characters), so I am planning to seek
>> > to the beginning, read so-many bytes, then encode the result into
>> > a string.
>>
>> You've got your concepts all over wrong.  That can't work.
>
> I knew someone would give me the traditional greeting:
> "Welcome to comp.lang.lisp. You're an idiot."

This is not what I wrote.


>> The file doesn't contain character.  Not on a POSIX system.  Files in
>> POSIX (including unix and MS-Windows) don't contain characters. They
>> contain only bytes.
>
> Quibbling is doubtless a waste of time here, but: a file contains
> whatever I want it to contain, and, by Odin's beard, if I want it
> to contain characters, then so it is.

But now I know this is what I should have written.



-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

ATTENTION: Despite any other listing of product contents found
herein, the consumer is advised that, in actuality, this product
consists of 99.9999999999% empty space.

From: Rafal Strzalinski
Subject: Re: Encoding bytes into UTF-8 string
Date: Fri, 24 Nov 2006 23:18:59 +0000
Message-ID: <72a3f$45677de4$d4ba8d22$15182@news.chello.pl>

Robert Dodier napisał(a):
> Hello,
> 
> I want to read bytes from a file containing UTF-8 characters and
> encode them into a string. Specifically, I have the byte offset of
> the beginning of the string and the number of bytes in the string
> (always a whole number of characters), so I am planning to seek
> to the beginning, read so-many bytes, then encode the result into
> a string.

You may read a file using LATIN-1 encoding (1 byte per character) and
than recode interesting part using flexi-streams[1]:

(flexi-streams:octets-to-string
   (flexi-streams:string-to-octets
    string  :external-format :latin1)  :external-format :utf-8)


[1] http://weitz.de/flexi-streams/

-- 
Best regards,
Rafal Strzalinski (nabla)
http://rafal.strzalinski.pl