From: Peter Seibel
Subject: Practical Common Lisp takes apart binary files
Date:
Message-ID: <m3u0v1zt3x.fsf@javamonkey.com>
Chapter 19 in which we write a library for parsing binary files is now
on-line at:
<http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>
For those of you who are just joining us, this is a chapter from a
book about Common Lisp that I'm writing for Apress. The main page at
<http://www.gigamonkeys.com/book/>
explains more. Basically I'm looking for feedback from any and all
about just about anything. I'm particularly interested in what bits
you found either confusing or helpful, especially if you are
relatively new to Lisp.
-Peter
--
Peter Seibel ·····@javamonkey.com
Lisp is the red pill. -- John Fraser, comp.lang.lisp
Greetings, and thanks for this as always!
ONe thing I've always missed from C is a convenient lisp binding to
mmap. Anyone interested if I work up something for GCL?
Take care,
Peter Seibel <·····@javamonkey.com> writes:
> Chapter 19 in which we write a library for parsing binary files is now
> on-line at:
>
> <http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>
>
> For those of you who are just joining us, this is a chapter from a
> book about Common Lisp that I'm writing for Apress. The main page at
>
> <http://www.gigamonkeys.com/book/>
>
> explains more. Basically I'm looking for feedback from any and all
> about just about anything. I'm particularly interested in what bits
> you found either confusing or helpful, especially if you are
> relatively new to Lisp.
>
> -Peter
>
> --
> Peter Seibel ·····@javamonkey.com
>
> Lisp is the red pill. -- John Fraser, comp.lang.lisp
--
Camm Maguire ····@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah
Camm Maguire wrote:
> Greetings, and thanks for this as always!
>
> ONe thing I've always missed from C is a convenient lisp binding to
> mmap. Anyone interested if I work up something for GCL?
I'll be more interested if you worked on making GCL pass more of the
ANSI test suite :)
... just keeping your priorities straight :)
Cheers
--
Marco
>
> Take care,
>
> Peter Seibel <·····@javamonkey.com> writes:
>
>
>>Chapter 19 in which we write a library for parsing binary files is now
>>on-line at:
>>
>> <http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>
>>
>>For those of you who are just joining us, this is a chapter from a
>>book about Common Lisp that I'm writing for Apress. The main page at
>>
>> <http://www.gigamonkeys.com/book/>
>>
>>explains more. Basically I'm looking for feedback from any and all
>>about just about anything. I'm particularly interested in what bits
>>you found either confusing or helpful, especially if you are
>>relatively new to Lisp.
>>
>>-Peter
>>
>>--
>>Peter Seibel ·····@javamonkey.com
>>
>> Lisp is the red pill. -- John Fraser, comp.lang.lisp
>
>
Peter Seibel <·····@javamonkey.com> writes:
> Chapter 19 in which we write a library for parsing binary files is now
> on-line at:
>
> <http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>
A silly point.
Quoting:
UTF-8 is a popular encoding for Unicode text that consists primarily
of characters from the the ASCII subset of Unicode since it encodes
all such characters in a single byte, just as they would be if encoded
in ASCII. However it can also encode any other Unicode character,
using two, three, or even four bytes.
it goes up to six...
Cheers,
mwh
--
ARTHUR: Yes. It was on display in the bottom of a locked filing
cabinet stuck in a disused lavatory with a sign on the door
saying "Beware of the Leopard".
-- The Hitch-Hikers Guide to the Galaxy, Episode 1
From: Peter Seibel
Subject: Re: Practical Common Lisp takes apart binary files
Date:
Message-ID: <m3d61pzpgh.fsf@javamonkey.com>
Michael Hudson <···@python.net> writes:
> Peter Seibel <·····@javamonkey.com> writes:
>
>> Chapter 19 in which we write a library for parsing binary files is now
>> on-line at:
>>
>> <http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>
>
> A silly point.
>
> Quoting:
>
> UTF-8 is a popular encoding for Unicode text that consists primarily
> of characters from the the ASCII subset of Unicode since it encodes
> all such characters in a single byte, just as they would be if encoded
> in ASCII. However it can also encode any other Unicode character,
> using two, three, or even four bytes.
>
> it goes up to six...
Really? Table 3.6: Well-Formed UTF-8 Byte Sequences on page 78 of
<http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf>
only lists forms up to four bytes. And on the previous page it says,
"Any UTF-8 byte sequence that does not match the patterns listed in
Table 3.6 is ill-formed." But I'm not Unicode expert so it's easily
possible I'm missing something.
-Peter
--
Peter Seibel ·····@javamonkey.com
Lisp is the red pill. -- John Fraser, comp.lang.lisp
Peter Seibel <·····@javamonkey.com> writes:
> But I'm not Unicode expert so it's easily possible I'm missing
> something.
I too was surprised to see the Unicode document limit UTF-8 to four
bytes, so I did some research and came up with this explanation�:
In fact, UTF-8 is able to use a sequence of up to six bytes and
cover the whole area 0x00-0x7FFFFFFF (31 bits), but UTF-8 was
restricted by RFC 3629 to only use the area covered by the formal
Unicode definition, 0x00-0x10FFFF, in November 2003. Before this,
only the bytes 0xFE and 0xFF did not occur in a UTF-8 encoded
text. After this limit was introduced, the number of unused bytes in
a UTF-8 stream increased to 13 bytes: 0xC0, 0xC1, 0xF5-0xFF. Even
though this new definition limits the available encoding area
severely, the problem with overlong sequences (different ways of
encoding the same character, which can be a security risk) is
eliminated, because an overlong sequence will contain some of these
bytes that are not used and therefore will not be a valid sequence.
I had written two UTF-8 encoder/decoders around 1999 and recall the
six-byte layout. Now it's shorter but more complicated, with more
special cases. I would have never thought that something as simple as
UTF-8 would be revised without its name being changed.
Footnotes:
� http://www.wordiq.com/definition/UTF-8
http://www.faqs.org/rfcs/rfc3629.html
--
Steven E. Harris
Peter Seibel <·····@javamonkey.com> writes:
> Michael Hudson <···@python.net> writes:
>
> > Peter Seibel <·····@javamonkey.com> writes:
> >
> >> Chapter 19 in which we write a library for parsing binary files is now
> >> on-line at:
> >>
> >> <http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>
> >
> > A silly point.
> >
> > Quoting:
> >
> > UTF-8 is a popular encoding for Unicode text that consists primarily
> > of characters from the the ASCII subset of Unicode since it encodes
> > all such characters in a single byte, just as they would be if encoded
> > in ASCII. However it can also encode any other Unicode character,
> > using two, three, or even four bytes.
> >
> > it goes up to six...
>
> Really? Table 3.6: Well-Formed UTF-8 Byte Sequences on page 78 of
>
> <http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf>
>
> only lists forms up to four bytes. And on the previous page it says,
> "Any UTF-8 byte sequence that does not match the patterns listed in
> Table 3.6 is ill-formed." But I'm not Unicode expert so it's easily
> possible I'm missing something.
Hmm. RFC 2279, "UTF-8, a transformation format of ISO 10646" says:
In UTF-8, characters are encoded using sequences of 1 to 6 octets.
but that's a considerably older document than the unicode 4 spec, so
maybe I'm out of date.
<reads>
Hmm, in http://www.unicode.org/versions/Unicode4.0.0/appC.pdf I see:
The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also
allows for the use of five- and six-byte sequences to encode
characters that are outside the range of the Unicode character
set; those five- and six-byte sequences are illegal for the use of
UTF-8 as an encoding form of Unicode characters.
So I guess you are right and I am wrong, though I'm not as wrong as I
could have been :-)
Cheers,
mwh
--
Hmmm... its Sunday afternoon: I could do my work, or I could do a
Fourier analysis of my computer's fan noise.
-- Amit Muthu, ucam.chat (from Owen Dunn's summary of the year)
From: Peter Seibel
Subject: Re: Practical Common Lisp takes apart binary files
Date:
Message-ID: <m31xi5znlt.fsf@javamonkey.com>
Michael Hudson <···@python.net> writes:
> Peter Seibel <·····@javamonkey.com> writes:
>
>> Michael Hudson <···@python.net> writes:
>>
>> > Peter Seibel <·····@javamonkey.com> writes:
>> >
>> >> Chapter 19 in which we write a library for parsing binary files is now
>> >> on-line at:
>> >>
>> >> <http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>
>> >
>> > A silly point.
>> >
>> > Quoting:
>> >
>> > UTF-8 is a popular encoding for Unicode text that consists primarily
>> > of characters from the the ASCII subset of Unicode since it encodes
>> > all such characters in a single byte, just as they would be if encoded
>> > in ASCII. However it can also encode any other Unicode character,
>> > using two, three, or even four bytes.
>> >
>> > it goes up to six...
>>
>> Really? Table 3.6: Well-Formed UTF-8 Byte Sequences on page 78 of
>>
>> <http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf>
>>
>> only lists forms up to four bytes. And on the previous page it says,
>> "Any UTF-8 byte sequence that does not match the patterns listed in
>> Table 3.6 is ill-formed." But I'm not Unicode expert so it's easily
>> possible I'm missing something.
>
> Hmm. RFC 2279, "UTF-8, a transformation format of ISO 10646" says:
>
> In UTF-8, characters are encoded using sequences of 1 to 6 octets.
>
> but that's a considerably older document than the unicode 4 spec, so
> maybe I'm out of date.
>
> <reads>
>
> Hmm, in http://www.unicode.org/versions/Unicode4.0.0/appC.pdf I see:
>
>
> The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also
> allows for the use of five- and six-byte sequences to encode
> characters that are outside the range of the Unicode character
> set; those five- and six-byte sequences are illegal for the use of
> UTF-8 as an encoding form of Unicode characters.
>
> So I guess you are right and I am wrong, though I'm not as wrong as I
> could have been :-)
Yeah, but just wait a while and you'll probably be right again. ;-)
Presumably the reason that only up to four-byte UTF-8 forms are legal
is because that's all that's needed to code the largest Unicode code
point 10ffff. But one of these days the Unicode folks will no doubt
feel the need to use 32-bits worth of code points at which time,
presumably, UTF-8 will have to be extended to five and six byte
encodings.
-Peter
--
Peter Seibel ·····@javamonkey.com
Lisp is the red pill. -- John Fraser, comp.lang.lisp
In article <··············@javamonkey.com>, Peter Seibel wrote:
>
> Yeah, but just wait a while and you'll probably be right again. ;-)
> Presumably the reason that only up to four-byte UTF-8 forms are legal
> is because that's all that's needed to code the largest Unicode code
> point 10ffff. But one of these days the Unicode folks will no doubt
> feel the need to use 32-bits worth of code points at which time,
> presumably, UTF-8 will have to be extended to five and six byte
> encodings.
>
> -Peter
>
Apparently not. The FAQ at unicode.org says:
Q: Will UTF-16 ever be extended to more than a million characters?
A: No. Both Unicode and ISO 10646 have policies in place that formally
limit future code assignment to the integer range that can be expressed
with current UTF-16 (0 to 1,114,111). Even if other encoding forms (i.e.
other UTFs) can represent larger intergers, these policies mean that all
encoding forms will always represent the same set of characters. Over a
million possible codes is far more than enough for the goal of Unicode of
encoding characters, not glyphs. Unicode is not designed to encode
arbitrary data. If you wanted, for example, to give each "instance of a
character on paper throughout history" its own code, you might need
trillions or quadrillions of such codes; noble as this effort might be,
you would not use Unicode for such an encoding.
How thoughtful of them...
--
Eric Daniel
Peter Seibel <·····@javamonkey.com> writes:
> Chapter 19 in which we write a library for parsing binary files is now
> on-line at:
>
> <http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>
>
How about using ldb/dpb?
(defun write-u2 (out value)
(write-byte (ldb (byte 8 8) n) out)
(write-byte (ldb (byte 8 0) n) out))
seems cleaner to me than
(defun write-u2 (out value)
(write-byte (logand #xff (ash value -8)) out)
(write-byte (logand #xff value) out))
Andras