From: Randall Randall
Subject: Unicode-handling library
Date: 
Message-ID: <43619482.0408241706.5354c5f9@posting.google.com>
I've started on a small library that simplifies unicode handling.  
It's currently intended to be fully portable Common Lisp, and 
the functions it defines should conform to the CLHS's definitions.

You can find it at 
http://www.randallsquared.com/download/unicode-0.99rc1.lisp .

In order to try it out, you'll need to get
http://www.randallsquared.com/download/tables.lisp 
and data from the Unicode consortium, at 
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt .

This basically has all the things I had in mind for 1.0, to wit:
import and export of UTF-8, UTF-16*, UTF-32*, us-ascii, ISO 8859-[1-16];
most string and character functions implemented;
The most basic 15100 characters (if UnicodeData.txt supplied).

Things it doesn't have yet, but planned for after 1.0, are:
Conversions: SCSU, &-escaped ASCII, CMUCL characters, 
  OpenMCL characters, etc
Include other unicode characters
rework at least some errors to be cerrors
convenience helpers for reading and writing files and other streams
handle one-to-many mappings of case for *ansi-compliant* ==> NIL
maybe more printer methods, though I'm not very familiar with those.

Trivial (because this file has no extended UTF-8 sequence) example:
* (with-open-file (f "/Users/randall/unicode-test.txt" 
                     :element-type '(unsigned-byte 8))
    (let ((utf8 (make-array 13)))   
      (read-sequence utf8 f)
      (utf-8->internal utf8)))
; =>
#(#\U+0054 #\U+0068 #\U+0069 #\U+0073 #\U+0020 #\U+0069 #\U+0073 
  #\U+0020 #\U+0061 #\U+0020 #\U+0075 #\U+006E #\U+0069)

If this is useful for anyone, I'd appreciate bug reports and feature
requests!

--
Randall Randall <·······@randallsquared.com>
Property law should use #'EQ , not #'EQUAL .

From: Arthur Lemmens
Subject: Re: Unicode-handling library
Date: 
Message-ID: <opsc92s801k6vmsw@news.xs4all.nl>
Hi Randall,

> I've started on a small library that simplifies unicode handling.

Looks nice.

I saw the following remark in your source:

> Takes a long time to start up, since it needs to read in all the unicode
> data.  This seems unlikely to improve with further code point databases,

I think you could solve that by reading in the unicode data at compile
time. To give you an idea of how this could be done, here are a few
snippets of my own library for dealing with character encodings:

(eval-when (:compile-toplevel :load-toplevel :execute)
  (defun load-unicode-table (filename)
    ;; Returns a mapping vector corresponding to the information in the
    ;; given mapping file from the Unicode consortium. ...))

(define-simple-character-encoding
 :iso-8859-2
 :vector #.(load-unicode-table "8859-2"))


Arthur
From: Randall Randall
Subject: Re: Unicode-handling library
Date: 
Message-ID: <43619482.0408251415.5eba5526@posting.google.com>
Arthur Lemmens <········@xs4all.nl> wrote in message news:<················@news.xs4all.nl>...
> Hi Randall,
> Looks nice.

Thanks!

> I saw the following remark in your source:
> 
> > Takes a long time to start up, since it needs to read in all the unicode
> > data.  This seems unlikely to improve with further code point databases,
> 
> I think you could solve that by reading in the unicode data at compile
> time. To give you an idea of how this could be done, here are a few
> snippets of my own library for dealing with character encodings:
> 
> (eval-when (:compile-toplevel :load-toplevel :execute)
[snip]

Adding (eval-when ...) does indeed mostly solve that problem,
reducing the load time on my 1Ghz G4 from ~50 seconds to ~2.

After separating the load, compilation, and package stuff into
package.lisp, there's a new version (also with some minor bugfixes):
http://www.randallsquared.com/download/unicode-0.99rc2.tar.gz

Thanks for the advice!

--
Randall Randall <·······@randallsquared.com>
Property law should use #'EQ , not #'EQUAL .
From: Klaus Harbo
Subject: Re: Unicode-handling library
Date: 
Message-ID: <87d614alrs.fsf@harbo.net>
I'm wondering if anyone could recommend a good, comprehensive book about Unicode?

-K.
From: Kalle Olavi Niemitalo
Subject: Re: Unicode-handling library
Date: 
Message-ID: <87llfscy4u.fsf@Astalo.kon.iki.fi>
Klaus Harbo <·····@harbo.net> writes:

> I'm wondering if anyone could recommend a good, comprehensive
> book about Unicode?

I bought The Unicode Standard Version 3.0 on paper, and it was a
mistake.  Most of the book is filled with code charts, which make
it unwieldy and are easier to use electronically.  The obscure
symbols have provided some laughs though.
From: Adam Warner
Subject: Re: Unicode-handling library
Date: 
Message-ID: <pan.2004.09.03.10.45.01.947217@consulting.net.nz>
Hi Klaus Harbo,

> I'm wondering if anyone could recommend a good, comprehensive book about
> Unicode?

The whole book is now available online:
<http://www.unicode.org/versions/Unicode4.0.1/>

Regards,
Adam