Practical Common Lisp takes apart binary files

From: Peter Seibel
Subject: Practical Common Lisp takes apart binary files
Date: Tue, 17 Aug 2004 15:26:35 +0000
Message-ID: <m3u0v1zt3x.fsf@javamonkey.com>

Chapter 19 in which we write a library for parsing binary files is now
on-line at:

  <http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>

For those of you who are just joining us, this is a chapter from a
book about Common Lisp that I'm writing for Apress. The main page at

  <http://www.gigamonkeys.com/book/>

explains more. Basically I'm looking for feedback from any and all
about just about anything. I'm particularly interested in what bits
you found either confusing or helpful, especially if you are
relatively new to Lisp.

-Peter

-- 
Peter Seibel                                      ·····@javamonkey.com

         Lisp is the red pill. -- John Fraser, comp.lang.lisp

Re: Practical Common Lisp takes apart binary files Camm Maguire
- Re: Practical Common Lisp takes apart binary files Marco Antoniotti
Re: Practical Common Lisp takes apart binary files Michael Hudson
- Re: Practical Common Lisp takes apart binary files Peter Seibel
  - Re: Practical Common Lisp takes apart binary files Steven E. Harris
  - Re: Practical Common Lisp takes apart binary files Michael Hudson
    - Re: Practical Common Lisp takes apart binary files Peter Seibel
      - Re: Practical Common Lisp takes apart binary files Eric Daniel
  - Re: Practical Common Lisp takes apart binary files Stefan Scholl
Re: Practical Common Lisp takes apart binary files Andras Simon

From: Camm Maguire
Subject: Re: Practical Common Lisp takes apart binary files
Date: Tue, 17 Aug 2004 17:06:49 +0000
Message-ID: <54y8kdu21i.fsf@intech19.enhanced.com>

Greetings, and thanks for this as always!

ONe thing I've always missed from C is a convenient lisp binding to
mmap.  Anyone interested if I work up something for GCL?

Take care,

Peter Seibel <·····@javamonkey.com> writes:

> Chapter 19 in which we write a library for parsing binary files is now
> on-line at:
> 
>   <http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>
> 
> For those of you who are just joining us, this is a chapter from a
> book about Common Lisp that I'm writing for Apress. The main page at
> 
>   <http://www.gigamonkeys.com/book/>
> 
> explains more. Basically I'm looking for feedback from any and all
> about just about anything. I'm particularly interested in what bits
> you found either confusing or helpful, especially if you are
> relatively new to Lisp.
> 
> -Peter
> 
> -- 
> Peter Seibel                                      ·····@javamonkey.com
> 
>          Lisp is the red pill. -- John Fraser, comp.lang.lisp

-- 
Camm Maguire			     			····@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah

From: Marco Antoniotti
Subject: Re: Practical Common Lisp takes apart binary files
Date: Wed, 18 Aug 2004 14:23:46 +0000
Message-ID: <TDJUc.38$D5.13038@typhoon.nyu.edu>

Camm Maguire wrote:

> Greetings, and thanks for this as always!
> 
> ONe thing I've always missed from C is a convenient lisp binding to
> mmap.  Anyone interested if I work up something for GCL?

I'll be more interested if you worked on making GCL pass more of the 
ANSI test suite :)

... just keeping your priorities straight :)

Cheers
--
Marco







> 
> Take care,
> 
> Peter Seibel <·····@javamonkey.com> writes:
> 
> 
>>Chapter 19 in which we write a library for parsing binary files is now
>>on-line at:
>>
>>  <http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>
>>
>>For those of you who are just joining us, this is a chapter from a
>>book about Common Lisp that I'm writing for Apress. The main page at
>>
>>  <http://www.gigamonkeys.com/book/>
>>
>>explains more. Basically I'm looking for feedback from any and all
>>about just about anything. I'm particularly interested in what bits
>>you found either confusing or helpful, especially if you are
>>relatively new to Lisp.
>>
>>-Peter
>>
>>-- 
>>Peter Seibel                                      ·····@javamonkey.com
>>
>>         Lisp is the red pill. -- John Fraser, comp.lang.lisp
> 
>

From: Michael Hudson
Subject: Re: Practical Common Lisp takes apart binary files
Date: Tue, 17 Aug 2004 16:18:53 +0000
Message-ID: <m37jrxn3f6.fsf@pc150.maths.bris.ac.uk>

Peter Seibel <·····@javamonkey.com> writes:

> Chapter 19 in which we write a library for parsing binary files is now
> on-line at:
> 
>   <http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>

A silly point.

Quoting:

    UTF-8 is a popular encoding for Unicode text that consists primarily
    of characters from the the ASCII subset of Unicode since it encodes
    all such characters in a single byte, just as they would be if encoded
    in ASCII. However it can also encode any other Unicode character,
    using two, three, or even four bytes.

it goes up to six...

Cheers,
mwh

-- 
  ARTHUR:  Yes.  It was on display in the bottom of a locked filing
           cabinet stuck in a disused lavatory with a sign on the door
           saying "Beware of the Leopard".
                    -- The Hitch-Hikers Guide to the Galaxy, Episode 1

From: Peter Seibel
Subject: Re: Practical Common Lisp takes apart binary files
Date: Tue, 17 Aug 2004 16:43:26 +0000
Message-ID: <m3d61pzpgh.fsf@javamonkey.com>

Michael Hudson <···@python.net> writes:

> Peter Seibel <·····@javamonkey.com> writes:
>
>> Chapter 19 in which we write a library for parsing binary files is now
>> on-line at:
>> 
>>   <http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>
>
> A silly point.
>
> Quoting:
>
>     UTF-8 is a popular encoding for Unicode text that consists primarily
>     of characters from the the ASCII subset of Unicode since it encodes
>     all such characters in a single byte, just as they would be if encoded
>     in ASCII. However it can also encode any other Unicode character,
>     using two, three, or even four bytes.
>
> it goes up to six...

Really? Table 3.6: Well-Formed UTF-8 Byte Sequences on page 78 of

  <http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf>

only lists forms up to four bytes. And on the previous page it says,
"Any UTF-8 byte sequence that does not match the patterns listed in
Table 3.6 is ill-formed." But I'm not Unicode expert so it's easily
possible I'm missing something.

-Peter

-- 
Peter Seibel                                      ·····@javamonkey.com

         Lisp is the red pill. -- John Fraser, comp.lang.lisp

From: Steven E. Harris
Subject: Re: Practical Common Lisp takes apart binary files
Date: Tue, 17 Aug 2004 17:27:21 +0000
Message-ID: <jk4vffhd69y.fsf@W003275.na.alarismed.com>

Peter Seibel <·····@javamonkey.com> writes:

> But I'm not Unicode expert so it's easily possible I'm missing
> something.

I too was surprised to see the Unicode document limit UTF-8 to four
bytes, so I did some research and came up with this explanation�:

  In fact, UTF-8 is able to use a sequence of up to six bytes and
  cover the whole area 0x00-0x7FFFFFFF (31 bits), but UTF-8 was
  restricted by RFC 3629 to only use the area covered by the formal
  Unicode definition, 0x00-0x10FFFF, in November 2003. Before this,
  only the bytes 0xFE and 0xFF did not occur in a UTF-8 encoded
  text. After this limit was introduced, the number of unused bytes in
  a UTF-8 stream increased to 13 bytes: 0xC0, 0xC1, 0xF5-0xFF. Even
  though this new definition limits the available encoding area
  severely, the problem with overlong sequences (different ways of
  encoding the same character, which can be a security risk) is
  eliminated, because an overlong sequence will contain some of these
  bytes that are not used and therefore will not be a valid sequence.

I had written two UTF-8 encoder/decoders around 1999 and recall the
six-byte layout. Now it's shorter but more complicated, with more
special cases. I would have never thought that something as simple as
UTF-8 would be revised without its name being changed.

Footnotes: 
� http://www.wordiq.com/definition/UTF-8
  http://www.faqs.org/rfcs/rfc3629.html

-- 
Steven E. Harris

From: Michael Hudson
Subject: Re: Practical Common Lisp takes apart binary files
Date: Tue, 17 Aug 2004 17:09:28 +0000
Message-ID: <m3y8kdlmif.fsf@pc150.maths.bris.ac.uk>

Peter Seibel <·····@javamonkey.com> writes:

> Michael Hudson <···@python.net> writes:
> 
> > Peter Seibel <·····@javamonkey.com> writes:
> >
> >> Chapter 19 in which we write a library for parsing binary files is now
> >> on-line at:
> >> 
> >>   <http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>
> >
> > A silly point.
> >
> > Quoting:
> >
> >     UTF-8 is a popular encoding for Unicode text that consists primarily
> >     of characters from the the ASCII subset of Unicode since it encodes
> >     all such characters in a single byte, just as they would be if encoded
> >     in ASCII. However it can also encode any other Unicode character,
> >     using two, three, or even four bytes.
> >
> > it goes up to six...
> 
> Really? Table 3.6: Well-Formed UTF-8 Byte Sequences on page 78 of
> 
>   <http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf>
> 
> only lists forms up to four bytes. And on the previous page it says,
> "Any UTF-8 byte sequence that does not match the patterns listed in
> Table 3.6 is ill-formed." But I'm not Unicode expert so it's easily
> possible I'm missing something.

Hmm.  RFC 2279, "UTF-8, a transformation format of ISO 10646" says:

    In UTF-8, characters are encoded using sequences of 1 to 6 octets.

but that's a considerably older document than the unicode 4 spec, so
maybe I'm out of date.

<reads>

Hmm, in http://www.unicode.org/versions/Unicode4.0.0/appC.pdf I see:


    The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also
    allows for the use of five- and six-byte sequences to encode
    characters that are outside the range of the Unicode character
    set; those five- and six-byte sequences are illegal for the use of
    UTF-8 as an encoding form of Unicode characters.

So I guess you are right and I am wrong, though I'm not as wrong as I
could have been :-)

Cheers,
mwh

-- 
  Hmmm... its Sunday afternoon: I could do my work, or I could do a
  Fourier analysis of my computer's fan noise.
       -- Amit Muthu, ucam.chat (from Owen Dunn's summary of the year)

From: Peter Seibel
Subject: Re: Practical Common Lisp takes apart binary files
Date: Tue, 17 Aug 2004 17:25:27 +0000
Message-ID: <m31xi5znlt.fsf@javamonkey.com>

Michael Hudson <···@python.net> writes:

> Peter Seibel <·····@javamonkey.com> writes:
>
>> Michael Hudson <···@python.net> writes:
>> 
>> > Peter Seibel <·····@javamonkey.com> writes:
>> >
>> >> Chapter 19 in which we write a library for parsing binary files is now
>> >> on-line at:
>> >> 
>> >>   <http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>
>> >
>> > A silly point.
>> >
>> > Quoting:
>> >
>> >     UTF-8 is a popular encoding for Unicode text that consists primarily
>> >     of characters from the the ASCII subset of Unicode since it encodes
>> >     all such characters in a single byte, just as they would be if encoded
>> >     in ASCII. However it can also encode any other Unicode character,
>> >     using two, three, or even four bytes.
>> >
>> > it goes up to six...
>> 
>> Really? Table 3.6: Well-Formed UTF-8 Byte Sequences on page 78 of
>> 
>>   <http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf>
>> 
>> only lists forms up to four bytes. And on the previous page it says,
>> "Any UTF-8 byte sequence that does not match the patterns listed in
>> Table 3.6 is ill-formed." But I'm not Unicode expert so it's easily
>> possible I'm missing something.
>
> Hmm.  RFC 2279, "UTF-8, a transformation format of ISO 10646" says:
>
>     In UTF-8, characters are encoded using sequences of 1 to 6 octets.
>
> but that's a considerably older document than the unicode 4 spec, so
> maybe I'm out of date.
>
> <reads>
>
> Hmm, in http://www.unicode.org/versions/Unicode4.0.0/appC.pdf I see:
>
>
>     The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also
>     allows for the use of five- and six-byte sequences to encode
>     characters that are outside the range of the Unicode character
>     set; those five- and six-byte sequences are illegal for the use of
>     UTF-8 as an encoding form of Unicode characters.
>
> So I guess you are right and I am wrong, though I'm not as wrong as I
> could have been :-)

Yeah, but just wait a while and you'll probably be right again. ;-)
Presumably the reason that only up to four-byte UTF-8 forms are legal
is because that's all that's needed to code the largest Unicode code
point 10ffff. But one of these days the Unicode folks will no doubt
feel the need to use 32-bits worth of code points at which time,
presumably, UTF-8 will have to be extended to five and six byte
encodings.

-Peter

-- 
Peter Seibel                                      ·····@javamonkey.com

         Lisp is the red pill. -- John Fraser, comp.lang.lisp

From: Eric Daniel
Subject: Re: Practical Common Lisp takes apart binary files
Date: Tue, 17 Aug 2004 18:44:30 +0000
Message-ID: <10i4kge870vcve5@corp.supernews.com>

In article <··············@javamonkey.com>, Peter Seibel wrote:
>  
>  Yeah, but just wait a while and you'll probably be right again. ;-)
>  Presumably the reason that only up to four-byte UTF-8 forms are legal
>  is because that's all that's needed to code the largest Unicode code
>  point 10ffff. But one of these days the Unicode folks will no doubt
>  feel the need to use 32-bits worth of code points at which time,
>  presumably, UTF-8 will have to be extended to five and six byte
>  encodings.
>  
>  -Peter
>  

Apparently not. The FAQ at unicode.org says:

  Q: Will UTF-16 ever be extended to more than a million characters?

  A: No. Both Unicode and ISO 10646 have policies in place that formally
  limit future code assignment to the integer range that can be expressed
  with current UTF-16 (0 to 1,114,111). Even if other encoding forms (i.e.
  other UTFs) can represent larger intergers, these policies mean that all
  encoding forms will always represent the same set of characters. Over a
  million possible codes is far more than enough for the goal of Unicode of
  encoding characters, not glyphs. Unicode is not designed to encode
  arbitrary data. If you wanted, for example, to give each "instance of a
  character on paper throughout history" its own code, you might need
  trillions or quadrillions of such codes; noble as this effort might be,
  you would not use Unicode for such an encoding.

How thoughtful of them...

-- 
Eric Daniel

From: Stefan Scholl
Subject: Re: Practical Common Lisp takes apart binary files
Date: Wed, 18 Aug 2004 12:04:56 +0000
Message-ID: <340ja88p30wn.dlg@parsec.no-spoon.de>

By the way: Markus Kuhn has written a nice man file for utf-8. It's
installed with most of the recent Linux distributions

==>	man utf-8

From: Andras Simon
Subject: Re: Practical Common Lisp takes apart binary files
Date: Tue, 17 Aug 2004 16:35:27 +0000
Message-ID: <vcdd61pbu40.fsf@csusza.math.bme.hu>

Peter Seibel <·····@javamonkey.com> writes:

> Chapter 19 in which we write a library for parsing binary files is now
> on-line at:
> 
>   <http://www.gigamonkeys.com/book/practical-parsing-binary-files.html>
> 

How about using ldb/dpb?

(defun write-u2 (out value)
  (write-byte (ldb (byte 8 8) n) out)
  (write-byte (ldb (byte 8 0) n) out))

seems cleaner to me than 

(defun write-u2 (out value)
  (write-byte (logand #xff (ash value -8)) out)
  (write-byte (logand #xff value) out))

Andras