How to handle non-ASCII characters in SBCL's Lisp reader?

From: Emre Sevinc
Subject: How to handle non-ASCII characters in SBCL's Lisp reader?
Date: Thu, 21 Jul 2005 17:34:58 +0000
Message-ID: <87vf34xdtp.fsf@ileriseviye.org>

As some of you may know I'm on my building my small linguistic 
utility. Things are fine up to now, at least for English.
However for Turkish characters (my native language) I
have run into problems.

One of the core parts of my program is as follows:


;;
;; The code to convert a bracketed list (from file)
;; into a Lisp list, suggested by Kent M. Pitman
;;
(defvar *parser-readtable* (copy-readtable))

(defun read-bracketed-list (stream char)
  (read-delimited-list #\] stream))

;; #\K or any other character does
(set-syntax-from-char #\' #\K *parser-readtable*)

(set-syntax-from-char #\] #\) *parser-readtable*)

(set-macro-character #\[ 'read-bracketed-list nil *parser-readtable*)

(defun parse-text (text)
  (let ((*readtable* *parser-readtable*))
    (setf (readtable-case *readtable*) :preserve) 
    (with-input-from-string (s (substitute #\* #\' text))
      (read s))))

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(defun read-linear-file (file-name)
"Reads file at path FILE-NAME until finding a [,
then parses it to a list of tokens and sublists"
  (let ((line (with-open-file (stream file-name) 
		(peek-char #\[ stream)	
		(read-line stream))))
    (parse-text line)))

This program is fine with English input such as:

[PP [Spec right] [P [P across] [NP [Spec the] [N [N ergin]]]]]

producing happily such an output

LINEAR2TREE> (read-linear-file "/home/fz/programming/Lisp/graphviz/grammar-input.txt")
(PP (|Spec| |right|) (P (P |across|) (NP (|Spec| |the|) (N (N |ergin|)))))

However when I feed it with a file containing Turkish characters
(except that the input file is correctly formatted):

LINEAR2TREE> (read-linear-file "/home/fz/programming/Lisp/graphviz/grammar-input-turkish.txt")
; Evaluation aborted

READER-ERROR at 15 (line 1, column 15) on #<SB-IMPL::STRING-INPUT-STREAM {92ACBE1}>:
undefined read-macro character #\�
   [Condition of type READER-ERROR]

The Turkish characters I'm talkin' about are those
(encoded as HTML entities):


&#286; Latin Capital Letter G With Breve (Capital G Breve)
&#287; Latin Small Letter G With Breve   (Small G Breve)
&#304; Latin Capital Letter I With Dot Above   (Capital I Dot)
&#305; Latin Small Letter Dotless I
&#350; Latin Capital Letter S With Cedilla     (Capital S Cedilla)
&#351; Latin Small Letter S With Cedilla       (Small S Cedilla)
&ccedil; Small c with cedilla
&Ccedil; Capital C with cedilla
&Ouml;   Capital O with dieresis
&ouml;   Small o with  dieresis
&Uuml;   Capital U with dieresis 
&uuml;   Small u with dieresis


How can I make my SBCL read that file without complaining
about some undefined read-macro characters? (I hope I can
do it with SBCL, even though I remember (vaguely) SBCL
was a little bit problematic for non-ASCII characters.)

Thanks in advance.

Happy hacking

-- 
Emre Sevinc

eMBA Software Developer         Actively engaged in:
http:www.bilgi.edu.tr           http://ileriseviye.org
http://www.bilgi.edu.tr         http://fazlamesai.net
Cognitive Science Student       http://cazci.com
http://www.cogsci.boun.edu.tr

Re: How to handle non-ASCII characters in SBCL's Lisp reader? Emre Sevinc
- Re: How to handle non-ASCII characters in SBCL's Lisp reader? Thomas F. Burdick
- Re: How to handle non-ASCII characters in SBCL's Lisp reader? Emre Sevinc
  - Re: How to handle non-ASCII characters in SBCL's Lisp reader? Marco Baringer
    - Re: How to handle non-ASCII characters in SBCL's Lisp reader? Emre Sevinc
- Re: How to handle non-ASCII characters in SBCL's Lisp reader? Ivan Boldyrev
- Re: How to handle non-ASCII characters in SBCL's Lisp reader? Ivan Boldyrev
  - Re: How to handle non-ASCII characters in SBCL's Lisp reader? Emre Sevinc
    - Re: How to handle non-ASCII characters in SBCL's Lisp reader? Ivan Boldyrev

From: Emre Sevinc
Subject: Re: How to handle non-ASCII characters in SBCL's Lisp reader?
Date: Thu, 21 Jul 2005 23:35:29 +0000
Message-ID: <87r7drybpa.fsf@ileriseviye.org>

Emre Sevinc <·····@bilgi.edu.tr> writes:

> READER-ERROR at 15 (line 1, column 15) on #<SB-IMPL::STRING-INPUT-STREAM {92ACBE1}>:
> undefined read-macro character #\�
>    [Condition of type READER-ERROR]
>
> The Turkish characters I'm talkin' about are those
> (encoded as HTML entities):
>
>
> &#286; Latin Capital Letter G With Breve (Capital G Breve)
> &#287; Latin Small Letter G With Breve   (Small G Breve)
> &#304; Latin Capital Letter I With Dot Above   (Capital I Dot)
> &#305; Latin Small Letter Dotless I
> &#350; Latin Capital Letter S With Cedilla     (Capital S Cedilla)
> &#351; Latin Small Letter S With Cedilla       (Small S Cedilla)
> &ccedil; Small c with cedilla
> &Ccedil; Capital C with cedilla
> &Ouml;   Capital O with dieresis
> &ouml;   Small o with  dieresis
> &Uuml;   Capital U with dieresis 
> &uuml;   Small u with dieresis
>
>
> How can I make my SBCL read that file without complaining
> about some undefined read-macro characters? (I hope I can
> do it with SBCL, even though I remember (vaguely) SBCL
> was a little bit problematic for non-ASCII characters.)


It seems like currently SBCL doesn't support ISO-8859-9
encoded Turkish-character-containing files.

I also tried something like:

 (with-open-file (stream file-name :external-format :latin-1)
  .
  .
  .

which didn't work (neither with :latin-1 nor with :latin-5).
 

A quick chat with people in #lisp told me that 
I need a very recent version of SBCL and even that
may not handle the case. 

Somebody said that Ivan Boldyrev's claims to add (in sbcl-devel)
a patch that makes handling encodings up to ISO-8859-14
but I don't have much idea.

Maybe I should give GNU CLISP a try?

It is a pity being forced to change Lisp implementation 
for an encoding issue :(

Xach suggested I hack src/code/src/fd-stream.lisp
(so that I can be famous in addition to solving my problem :)
which I don't consider a very strong alternative now.

I was also made about repositories:

deb http://people.debian.org/~pvaneynd/cl-sarge-packages ./
deb http://people.debian.org/~pvaneynd/cl-packages ./

But I'm not sure if latest SBCL can solve my problem
(the official version from Debian repository didn't even
insall successfuly :(


-- 
Emre Sevinc

eMBA Software Developer         Actively engaged in:
http:www.bilgi.edu.tr           http://ileriseviye.org
http://www.bilgi.edu.tr         http://fazlamesai.net
Cognitive Science Student       http://cazci.com
http://www.cogsci.boun.edu.tr

From: Thomas F. Burdick
Subject: Re: How to handle non-ASCII characters in SBCL's Lisp reader?
Date: Fri, 22 Jul 2005 06:11:06 +0000
Message-ID: <xcvll3zpdz9.fsf@conquest.OCF.Berkeley.EDU>

Emre Sevinc <·····@bilgi.edu.tr> writes:

> Somebody said that Ivan Boldyrev's claims to add (in sbcl-devel)
> a patch that makes handling encodings up to ISO-8859-14
> but I don't have much idea.

That would probably be your best bet -- that or using utf8 instead of
latin-5, if you can.

> Maybe I should give GNU CLISP a try?
> 
> It is a pity being forced to change Lisp implementation 
> for an encoding issue :(

I agree -- there may be reasons to switch between SBCL and CLISP, but
character support should no longer be one of them.  All the difficult
work has been done for adding Unicode to SBCL, all that remains is
creating tables for the various mappings between 8-bit bytes and
unicode characters.

> Xach suggested I hack src/code/src/fd-stream.lisp
> (so that I can be famous in addition to solving my problem :)
> which I don't consider a very strong alternative now.

Why not?  The hardest part is making a table that describes the
encoding: a string whose zeroth, first, second, ... nth ... 255th
elements are the appropriate characters for the bytes 0, 1, ... n ... 255.

I assume that would be fairly easy to make for someone who actually
uses latin-5.  If you posted the code to create such a table to
sbcl-devel, I'm quite certain you would quickly recieve help with
where to proceed from there.

> But I'm not sure if latest SBCL can solve my problem
> (the official version from Debian repository didn't even
> insall successfuly :(

No, the latest SBCL supports utf8, latin-1, ascii, and ebcdic
encodings.  I recommend compiling SBCL yourself at least for the
moment; if you have another SBCL or CMUCL installed, it's simple to
build from source, and it will make it easier for sbcl developers to
provide you with assistance in changing the implementation itself.

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | Free Mumia Abu-Jamal! |
     ,--'    _,'   | Abolish the racist    |
    /       /      | death penalty!        |
   (   -.  |       `-----------------------'
   |     ) |                               
  (`-.  '--.)                              
   `. )----'

From: Emre Sevinc
Subject: Re: How to handle non-ASCII characters in SBCL's Lisp reader?
Date: Fri, 22 Jul 2005 00:15:35 +0000
Message-ID: <87mzofy9ug.fsf@ileriseviye.org>

Emre Sevinc <·····@bilgi.edu.tr> writes:

>> &#286; Latin Capital Letter G With Breve (Capital G Breve)
>> &#287; Latin Small Letter G With Breve   (Small G Breve)
>> &#304; Latin Capital Letter I With Dot Above   (Capital I Dot)
>> &#305; Latin Small Letter Dotless I
>> &#350; Latin Capital Letter S With Cedilla     (Capital S Cedilla)
>> &#351; Latin Small Letter S With Cedilla       (Small S Cedilla)
>> &ccedil; Small c with cedilla
>> &Ccedil; Capital C with cedilla
>> &Ouml;   Capital O with dieresis
>> &ouml;   Small o with  dieresis
>> &Uuml;   Capital U with dieresis 
>> &uuml;   Small u with dieresis
>>
>>
>> How can I make my SBCL read that file without complaining
>> about some undefined read-macro characters? (I hope I can
>> do it with SBCL, even though I remember (vaguely) SBCL
>> was a little bit problematic for non-ASCII characters.)
>
>
> It seems like currently SBCL doesn't support ISO-8859-9
> encoded Turkish-character-containing files.
>
> I also tried something like:
>
>  (with-open-file (stream file-name :external-format :latin-1)
>   .
>   .
>   .
>
> which didn't work (neither with :latin-1 nor with :latin-5).
>  
>
> A quick chat with people in #lisp told me that 
> I need a very recent version of SBCL and even that
> may not handle the case. 
>
> Somebody said that Ivan Boldyrev's claims to add (in sbcl-devel)
> a patch that makes handling encodings up to ISO-8859-14
> but I don't have much idea.
>
> Maybe I should give GNU CLISP a try?




BTW, if I install CLISP package what is the correct
way of saying Emacs and SLIME "now switch to CLISP
as the Lisp implementation" and "now switch back to
SBCL"?




>
> It is a pity being forced to change Lisp implementation 
> for an encoding issue :(
>
> Xach suggested I hack src/code/src/fd-stream.lisp
> (so that I can be famous in addition to solving my problem :)
> which I don't consider a very strong alternative now.
>
> I was also made about repositories:
>
> deb http://people.debian.org/~pvaneynd/cl-sarge-packages ./
> deb http://people.debian.org/~pvaneynd/cl-packages ./
>
> But I'm not sure if latest SBCL can solve my problem
> (the official version from Debian repository didn't even
> insall successfuly :(


-- 
Emre Sevinc

eMBA Software Developer         Actively engaged in:
http:www.bilgi.edu.tr           http://ileriseviye.org
http://www.bilgi.edu.tr         http://fazlamesai.net
Cognitive Science Student       http://cazci.com
http://www.cogsci.boun.edu.tr

From: Marco Baringer
Subject: Re: How to handle non-ASCII characters in SBCL's Lisp reader?
Date: Fri, 22 Jul 2005 09:21:58 +0000
Message-ID: <m264v3dwll.fsf@soma.local>

Emre Sevinc <·····@bilgi.edu.tr> writes:

> BTW, if I install CLISP package what is the correct
> way of saying Emacs and SLIME "now switch to CLISP
> as the Lisp implementation" and "now switch back to
> SBCL"?

1) using slime's builtin multiple implemetation support:

(slime-register-lisp-implementation "clisp" "/usr/bin/clisp")
(slime-register-lisp-implementation "sbcl" "/usr/bin/sbcl")

then do "C-u M-x slime RET clisp" to clisp or "C-u M-x slime RET sbcl"
to run sbcl.

2) using hand written elisp:

(defun clisp ()
  (interactive)
  (slime "/usr/bin/clisp"))

(defun sbcl ()
  (interactive)
  (slime "/usr/bin/sbcl"))

start clisp with M-x clisp, start sbcll with M-x sbcl.

-- 
-Marco
Ring the bells that still can ring.
Forget the perfect offering.
There is a crack in everything.
That's how the light gets in.
	-Leonard Cohen

From: Emre Sevinc
Subject: Re: How to handle non-ASCII characters in SBCL's Lisp reader?
Date: Sat, 23 Jul 2005 23:30:27 +0000
Message-ID: <87ll3xnlrg.fsf@ileriseviye.org>

"Marco Baringer" <··@bese.it> writes:

> Emre Sevinc <·····@bilgi.edu.tr> writes:
>
>> BTW, if I install CLISP package what is the correct
>> way of saying Emacs and SLIME "now switch to CLISP
>> as the Lisp implementation" and "now switch back to
>> SBCL"?
>
> 1) using slime's builtin multiple implemetation support:
>
> (slime-register-lisp-implementation "clisp" "/usr/bin/clisp")
> (slime-register-lisp-implementation "sbcl" "/usr/bin/sbcl")
>
> then do "C-u M-x slime RET clisp" to clisp or "C-u M-x slime RET sbcl"
> to run sbcl.
>
> 2) using hand written elisp:
>
> (defun clisp ()
>   (interactive)
>   (slime "/usr/bin/clisp"))
>
> (defun sbcl ()
>   (interactive)
>   (slime "/usr/bin/sbcl"))
>
> start clisp with M-x clisp, start sbcll with M-x sbcl.


Thanks for the beautiful "tip of the day" :)
This is really useful for having many Lisp implementations
at once.


-- 
Emre Sevinc

eMBA Software Developer         Actively engaged in:
http:www.bilgi.edu.tr           http://ileriseviye.org
http://www.bilgi.edu.tr         http://fazlamesai.net
Cognitive Science Student       http://cazci.com
http://www.cogsci.boun.edu.tr

From: Ivan Boldyrev
Subject: Re: How to handle non-ASCII characters in SBCL's Lisp reader?
Date: Fri, 22 Jul 2005 15:08:20 +0000
Message-ID: <6oe7r2-6dr.ln1@ibhome.cgitftp.uiggm.nsc.ru>

--Ron-Brown/White-Water/USDOJ/eternity-server/H76BHRspKB
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

On 9178 day of my life Emre Sevinc wrote:
> Emre Sevinc <·····@bilgi.edu.tr> writes:
> Somebody said that Ivan Boldyrev's claims to add (in sbcl-devel)
> a patch that makes handling encodings up to ISO-8859-14
> but I don't have much idea.

1.  Get <http://cgitftp.uiggm.nsc.ru/~ib/sbcl-enc.tar.bz2>  (PGP
    signature: <http://cgitftp.uiggm.nsc.ru/~ib/sbcl-enc.tar.bz2.asc>).

2.  Open enc-iso.lisp in your favourite text editor, replace #:sb!imp by
    #:sb-imp.  Save file.

3.  (progn (compile-file "enc-iso")
           (load "enc-iso"))

4.  Dump SBCL core.
    Bingo!  You have ISO-8859-9-powered core.


There is some patch to sbcl/src/code/octets.lisp, but ISO-8859-9 is
not affected.

=2D-=20
Ivan Boldyrev

                                                  Your bytes are bitten.

--Ron-Brown/White-Water/USDOJ/eternity-server/H76BHRspKB
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1-ecc0.1.6 (GNU/Linux)

iEYEABECAAYFAkLhC+YACgkQ4rmsj66VbhexygCeO0qhL/igOrCR8oFvze1tkc5d
S9AAnRPyQs8p5PonttTzBg9Zd0X/zEYB
=JGdr
-----END PGP SIGNATURE-----
--Ron-Brown/White-Water/USDOJ/eternity-server/H76BHRspKB--

From: Ivan Boldyrev
Subject: Re: How to handle non-ASCII characters in SBCL's Lisp reader?
Date: Fri, 22 Jul 2005 16:30:45 +0000
Message-ID: <lij7r2-rht.ln1@ibhome.cgitftp.uiggm.nsc.ru>

On 9178 day of my life Emre Sevinc wrote:
> I also tried something like:
>
>  (with-open-file (stream file-name :external-format :latin-1)
>   .
>   .
>   .
>
> which didn't work

Why does it doesn't work?   *Every* byte sequense can be read as
Latin-1 string...

Wich version of SBCL do you currently use?  Is it Unicode-enabled?

My patches are for Unicode-enabled SBCL (properly ./configure-ed 0.9.x).

-- 
Ivan Boldyrev

                                        | recursion, n:
                                        |       See recursion

From: Emre Sevinc
Subject: Re: How to handle non-ASCII characters in SBCL's Lisp reader?
Date: Sat, 23 Jul 2005 23:41:15 +0000
Message-ID: <87hdelnl9g.fsf@ileriseviye.org>

Ivan Boldyrev <···············@cgitftp.uiggm.nsc.ru> writes:

> On 9178 day of my life Emre Sevinc wrote:
>> I also tried something like:
>>
>>  (with-open-file (stream file-name :external-format :latin-1)
>>   .
>>   .
>>   .
>>
>> which didn't work
>
> Why does it doesn't work?   *Every* byte sequense can be read as
> Latin-1 string...
>
> Wich version of SBCL do you currently use?  Is it Unicode-enabled?
> My patches are for Unicode-enabled SBCL (properly ./configure-ed 0.9.x).

I was using an old one, 0.8.x but today I have switched
to linux kernel v. 2.6.11 on my Debian GNU/Linux system
and was finally able to install SBCL 0.9.2 with Unicode support
and then I changed my web application to use UTF-8 encoding
(instead of iso-8859-9).

Now when you try:

  http://fz.dyndns.org:8080/grammar-form

it comes with utf-8 encoding and as far I can test
it works fine for all of the Turkish characters (at least
the .png output is fine, .ps and .pdf needs to be corrected,
I guess that's a problem either with ps fonts or graphviz,
I'm not sure).

BTW, when is your patch going to be a part of the official
SBCL distribution? Even though I solved my problem using
current SBCL binary package and utf-8 I'd like to learn
about when it will also support iso-8859-9.

Happy hacking.

-- 
Emre Sevinc

eMBA Software Developer         Actively engaged in:
http:www.bilgi.edu.tr           http://ileriseviye.org
http://www.bilgi.edu.tr         http://fazlamesai.net
Cognitive Science Student       http://cazci.com
http://www.cogsci.boun.edu.tr

From: Ivan Boldyrev
Subject: Re: How to handle non-ASCII characters in SBCL's Lisp reader?
Date: Sun, 24 Jul 2005 05:14:37 +0000
Message-ID: <umkbr2-ecc.ln1@ibhome.cgitftp.uiggm.nsc.ru>

On 9180 day of my life Emre Sevinc wrote:
> BTW, when is your patch going to be a part of the official
> SBCL distribution?

I have no slightest idea.  No any reaction from SBCL mantainers.
Monday is 0.9.3 release time, so they are probably very busy :)

-- 
Ivan Boldyrev

                        Today is the first day of the rest of your life.