How to convert byte -> char

From: Dave
Subject: How to convert byte -> char
Date: Sat, 08 Mar 2003 00:50:48 +0000
Message-ID: <jy2dnbiZ_M1yofSjXTWcqg@comcast.com>

I think this is a silly quesion, and I must be missing something 
obvious, but I have been unable to find the answer. All I can find in 
CLTL2 is mention that int-char has been removed. All my attempts at 
coercion have failed (with CMUCL). I have an array of unsigned bytes, 
how can I convert them to characters and store them in a string?

Any help would be appreciated!
Dave.

Re: How to convert byte -> char Thomas F. Burdick
- Re: How to convert byte -> char Dave
- Re: How to convert byte -> char Harald Hanche-Olsen
  - Re: How to convert byte -> char Kenny Tilton
  - Re: How to convert byte -> char Thomas F. Burdick
  - Re: How to convert byte -> char Joerg Hoehle

From: Thomas F. Burdick
Subject: Re: How to convert byte -> char
Date: Sat, 08 Mar 2003 00:50:42 +0000
Message-ID: <xcvr89ivfz1.fsf@fallingrocks.OCF.Berkeley.EDU>

Dave <··········@comcast.net> writes:

> I think this is a silly quesion, and I must be missing something 
> obvious, but I have been unable to find the answer. All I can find in 
> CLTL2 is mention that int-char has been removed. All my attempts at 
> coercion have failed (with CMUCL). I have an array of unsigned bytes, 
> how can I convert them to characters and store them in a string?
> 
> Any help would be appreciated!

This is a great example of how CLtL2 can be confusing, because it
concentrates on the differences from CLtL1.  The HyperSpec is much
nicer in this regard, and is acutally the spec, as opposed to being
based on an early draft.

The functions you want are:

  CODE-CHAR: converts from code -> char
and
  CHAR-CODE: converts from char -> code

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | No to Imperialist war |                        
     ,--'    _,'   | Wage class war!       |                        
    /       /      `-----------------------'                        
   (   -.  |                               
   |     ) |                               
  (`-.  '--.)                              
   `. )----'

From: Dave
Subject: Re: How to convert byte -> char
Date: Sat, 08 Mar 2003 01:32:49 +0000
Message-ID: <5SOdnQ_LBbtV2_SjXTWcpg@comcast.com>

Excellent, thanks! I'm just getting started with lisp, so I usually find 
the narrative and examples from CLtL2 to be a bit more helpful than the 
terse technical language of the HyperSpec. I'm keeping a bookmark folder 
full of links to docs and tutorials I've found useful around the net, 
but I'm glad to have folks like you on c.l.lisp when I get stuck!

Dave.

Thomas F. Burdick wrote:
> Dave <··········@comcast.net> writes:
> 
> 
>>I think this is a silly quesion, and I must be missing something 
>>obvious, but I have been unable to find the answer. All I can find in 
>>CLTL2 is mention that int-char has been removed. All my attempts at 
>>coercion have failed (with CMUCL). I have an array of unsigned bytes, 
>>how can I convert them to characters and store them in a string?
>>
>>Any help would be appreciated!
> 
> 
> This is a great example of how CLtL2 can be confusing, because it
> concentrates on the differences from CLtL1.  The HyperSpec is much
> nicer in this regard, and is acutally the spec, as opposed to being
> based on an early draft.
> 
> The functions you want are:
> 
>   CODE-CHAR: converts from code -> char
> and
>   CHAR-CODE: converts from char -> code
>

From: Harald Hanche-Olsen
Subject: Re: How to convert byte -> char
Date: Sat, 08 Mar 2003 22:19:58 +0000
Message-ID: <pco65qtlcvl.fsf@thoth.math.ntnu.no>

+ ···@fallingrocks.OCF.Berkeley.EDU (Thomas F. Burdick):

| Dave <··········@comcast.net> writes:
| 
| > I have an array of unsigned bytes, how can I convert them to
| > characters and store them in a string?
| 
| The functions you want are:
| 
|   CODE-CHAR: converts from code -> char
| and
|   CHAR-CODE: converts from char -> code

Indeed.  However, sometimes I'd like to do better.  In order to have a
specific example to talk about, assume I am looking at a PDF file.
Now a typical PDF file contains a mixture of binary and textual data.
If I wish to parse such a beast, one strategy might be to suck the
entire file into a big array of bytes and then play around with the
array.  Every time I find a piece of text and want to treat it as
such, I then wind up copying it into a string using

(map 'string (subseq buffer start end))

except that will make /two/ copies, so I might try

(map 'string #'code-char
     (make-array length :element-type '(unsigned-byte 8)
                 :displaced-to buffer :displaced-index-offset start))

which will make only one copy.  Still, this /feels/ inefficient since
the conversion involved, when viewed at the bit level, really is the
identity map.

Is there a way to do this more efficiently?  Or should I not worry?
That I do (on occassion) worry about such things is perhaps just a
relic of my C days.

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- Yes it works in practice - but does it work in theory?

From: Kenny Tilton
Subject: Re: How to convert byte -> char
Date: Sun, 09 Mar 2003 00:24:43 +0000
Message-ID: <3E6A8B8F.20805@nyc.rr.com>

Harald Hanche-Olsen wrote:
> (map 'string #'code-char
>      (make-array length :element-type '(unsigned-byte 8)
>                  :displaced-to buffer :displaced-index-offset start))
> 
> which will make only one copy.  Still, this /feels/ inefficient since
> the conversion involved, when viewed at the bit level, really is the
> identity map.
> 
> Is there a way to do this more efficiently?  Or should I not worry?
> That I do (on occassion) worry about such things is perhaps just a
> relic of my C days.

What would you do in "C"? allocate a structure which says "there is text 
which starts here (and runs this long if PDF text is not null 
terminated)"? Or would you just put the address/length in a list of 
found text chunks? You can do the same in Lisp if you do not want to 
copy text out of the PDF array into standalone strings. And that seems 
reasonable if you'd be copying large wadges of text to get a standalone 
string. So instead create a parallel structure of data structures which 
point back into the PDF array, then manipulate those.

We all know not to optimize prematurely, but this does not feel like 
that, it feels more like not duplicating needlessly, using a minimum of 
information to capture the result of parsing the PDF.

-- 

  kenny tilton
  clinisys, inc
  http://www.tilton-technology.com/
  ---------------------------------------------------------------
"Cells let us walk, talk, think, make love and realize
  the bath water is cold." -- Lorraine Lee Cudmore

From: Thomas F. Burdick
Subject: Re: How to convert byte -> char
Date: Sun, 09 Mar 2003 19:43:57 +0000
Message-ID: <xcv4r6c1g1u.fsf@apocalypse.OCF.Berkeley.EDU>

Harald Hanche-Olsen <······@math.ntnu.no> writes:

> + ···@fallingrocks.OCF.Berkeley.EDU (Thomas F. Burdick):
> 
> | Dave <··········@comcast.net> writes:
> | 
> | > I have an array of unsigned bytes, how can I convert them to
> | > characters and store them in a string?
> | 
> | The functions you want are:
> | 
> |   CODE-CHAR: converts from code -> char
> | and
> |   CHAR-CODE: converts from char -> code
> 
> Indeed.  However, sometimes I'd like to do better.  In order to have a
> specific example to talk about, assume I am looking at a PDF file.
> Now a typical PDF file contains a mixture of binary and textual data.
> If I wish to parse such a beast, one strategy might be to suck the
> entire file into a big array of bytes and then play around with the
> array.  Every time I find a piece of text and want to treat it as
> such, I then wind up copying it into a string using
> 
> (map 'string (subseq buffer start end))
> 
> except that will make /two/ copies, so I might try

Yuck

> (map 'string #'code-char
>      (make-array length :element-type '(unsigned-byte 8)
>                  :displaced-to buffer :displaced-index-offset start))

Much nicer.

> which will make only one copy.  Still, this /feels/ inefficient since
> the conversion involved, when viewed at the bit level, really is the
> identity map.

You don't operate directly on bits in CL, but even on the bit level,
it's still not quite an identity operation.  But...

> Is there a way to do this more efficiently?  Or should I not worry?
> That I do (on occassion) worry about such things is perhaps just a
> relic of my C days.

Well, you really shouldn't be slurping a PDF file into memory if you
don't have to (IMO), as it kind of defeats the point of the format.
The way I do this is to have a stream of the PDF, and utility
functions to read objects from it.  If you're using bivalent streams
(where you can change the stream's element-type), this is easier.  If
not, just use an (unsigned-byte 8) stream, and write your utility
functions to read from that.  For example, something like:

  (defun pdf-read-string (stream &key (length 0) (index 0))
    (loop with result = (make-array (list length) :element-type 'character)
          for i from 0 below length
          for char = (code-char (read-byte stream))
          initially (file-position stream index)
          do (setf (aref result i) char)
          finally (return result)))

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | No to Imperialist war |                        
     ,--'    _,'   | Wage class war!       |                        
    /       /      `-----------------------'                        
   (   -.  |                               
   |     ) |                               
  (`-.  '--.)                              
   `. )----'

From: Joerg Hoehle
Subject: Re: How to convert byte -> char
Date: Thu, 13 Mar 2003 15:36:29 +0000
Message-ID: <uvfynth1e.fsf@dont.t-systems.UCE.spam.no.com>

Harald Hanche-Olsen <······@math.ntnu.no> writes:
> Now a typical PDF file contains a mixture of binary and textual data.
[...]
> Still, this /feels/ inefficient since
> the conversion involved, when viewed at the bit level, really is the
> identity map.

This must not be necessarily so. For example, CLISP uses some form of
UNICODE internally (actually it uses varying representations). So
strings may well be 2 or 4 bytes wide per character, and the
representation in memory will differ from 8-bit-bytes.

> Is there a way to do this more efficiently?  Or should I not worry?
> That I do (on occassion) worry about such things is perhaps just a
> relic of my C days.

It's more a relict of your 1-byte-a-character heritage. You're not
asian, are you? (not with your name) :-) They have been using various
multibyte encodings for years.

Recently, I read a 50MB file using READ-SEQUENCE into an array of
(UNSIGNED-BYTE 8) with length (ash 2 17) (-> ~256KB). It took 2
seconds. Using a string of same length, (ash 2 17), it took 4
seconds. I concluded that unicode handling in CLISP is fast enough for
my application's needs -- the time for reading is negligible compared
with the rest of the application's execution time.

Regards,
	Jorg Hohle
Telekom/T-Systems Technology Center