SBCL & UTF-8 Errors

From: Dennis Dunn
Subject: SBCL & UTF-8 Errors
Date: Sun, 12 Feb 2006 04:32:21 +0000
Message-ID: <pan.2006.02.12.04.32.06.244121@insight.rr.com>

Hello,

I am trying to process a file line-by-line and everything was working
perfectly until I get this message :

decoding error on stream
#<SB-SYS:FD-STREAM for "file
\"/home/ansofive/projects/camc/lexicon/a.txt\"" {A1C48E9}>
(:EXTERNAL-FORMAT :UTF-8):
  the octet sequence (142) cannot be decoded.
   [Condition of type SB-INT:STREAM-DECODING-ERROR]

Restarts:
  0: [ATTEMPT-RESYNC] Attempt to resync the stream at a character
  character boundary and continue. 
  1: [FORCE-END-OF-FILE] Force an end of file. 
  2: [ABORT-REQUEST] Abort handling SLIME request. 
  3: [TERMINATE-THREAD] Terminate this thread (3063307168)

I'm running SBCL 0.9.2 on an Ubuntu 5.10 box. The offending characters
seem to show up like this in vi: <Be>  I've tried to find an answer on
Google and in the Hyperspec but I didn't find anything that seemed to
help.  I did see that SBCL had a unicode bug back around 0.8.something but
it had been fixed.

For what it's worth, I'm trying to stuff the Moby part-of-speech lexicon
into a PostgreSQL database table. 

Thanks for any help on where I should go from here.

--dennis

Re: SBCL & UTF-8 Errors Tayssir John Gabbour
- Re: SBCL & UTF-8 Errors Dennis Dunn
Re: SBCL & UTF-8 Errors Juho Snellman
- Re: SBCL & UTF-8 Errors Dennis Dunn
  - Re: SBCL & UTF-8 Errors Harald Hanche-Olsen
  - Re: SBCL & UTF-8 Errors Ivan Boldyrev
    - Re: SBCL & UTF-8 Errors Dennis Dunn

From: Tayssir John Gabbour
Subject: Re: SBCL & UTF-8 Errors
Date: Sun, 12 Feb 2006 07:04:09 +0000
Message-ID: <1139727849.654865.71330@g14g2000cwa.googlegroups.com>

Dennis Dunn wrote:
> I am trying to process a file line-by-line and everything was working
> perfectly until I get this message :
>
> decoding error on stream
> #<SB-SYS:FD-STREAM for "file
> \"/home/ansofive/projects/camc/lexicon/a.txt\"" {A1C48E9}>
> (:EXTERNAL-FORMAT :UTF-8):
>   the octet sequence (142) cannot be decoded.
>    [Condition of type SB-INT:STREAM-DECODING-ERROR]
>
> Restarts:
>   0: [ATTEMPT-RESYNC] Attempt to resync the stream at a character
>   character boundary and continue.
>   1: [FORCE-END-OF-FILE] Force an end of file.
>   2: [ABORT-REQUEST] Abort handling SLIME request.
>   3: [TERMINATE-THREAD] Terminate this thread (3063307168)
>
> I'm running SBCL 0.9.2 on an Ubuntu 5.10 box. The offending characters
> seem to show up like this in vi: <Be>

I once had a problem which looked vaguely similar, and I solved it with
these lines in my .emacs:

;;http://www.cliki.net/SLIME%20Tips#unicode
(set-language-environment "UTF-8")
(setq slime-net-coding-system 'utf-8-unix)

But as I'm not exactly an Emacs guru, I have no idea whether this is
useful for you.

Tayssir

From: Dennis Dunn
Subject: Re: SBCL & UTF-8 Errors
Date: Sun, 12 Feb 2006 19:20:37 +0000
Message-ID: <pan.2006.02.12.19.20.23.273803@insight.rr.com>

Hello,

> I once had a problem which looked vaguely similar, and I solved it
> with these lines in my .emacs:
> 
> ;;http://www.cliki.net/SLIME%20Tips#unicode (set-language-environment
> "UTF-8")
> (setq slime-net-coding-system 'utf-8-unix)

I tried this but it didn't help.  I ended up downloading SBCL 0.9.9 and
using :external-format.

Thanks,
--dennis

From: Juho Snellman
Subject: Re: SBCL & UTF-8 Errors
Date: Sun, 12 Feb 2006 08:03:05 +0000
Message-ID: <slrndutqtp.3nm.jsnell@sbz-30.cs.Helsinki.FI>

Dennis Dunn <·····@insight.rr.com> wrote:
> Hello,
> 
> I am trying to process a file line-by-line and everything was working
> perfectly until I get this message :
> 
> decoding error on stream
> #<SB-SYS:FD-STREAM for "file
> \"/home/ansofive/projects/camc/lexicon/a.txt\"" {A1C48E9}>
> (:EXTERNAL-FORMAT :UTF-8):
>   the octet sequence (142) cannot be decoded.
>    [Condition of type SB-INT:STREAM-DECODING-ERROR]

You're trying to read a file as UTF-8 (probably because your system
locale settings), but it doesn't contain valid UTF-8 encoded data.
Either change your locale to some non-unicode one, fix the file to
contain valid UTF-8, or use the :EXTERNAL-FORMAT keyword parameter to
OPEN / WITH-OPEN-FILE to specify the encoding. e.g.

  (with-open-file (stream "/foo/bar" :external-format :iso-8859-1)
     ...)

-- 
Juho Snellman

From: Dennis Dunn
Subject: Re: SBCL & UTF-8 Errors
Date: Sun, 12 Feb 2006 19:17:33 +0000
Message-ID: <pan.2006.02.12.19.17.17.965216@insight.rr.com>

Hello,

<snip>

> contain valid UTF-8, or use the :EXTERNAL-FORMAT keyword parameter to
> OPEN / WITH-OPEN-FILE to specify the encoding. e.g.
> 
>   (with-open-file (stream "/foo/bar" :external-format :iso-8859-1)
>      ...)

Thanks for your reply.  I had read the bit about :external-format in the
Hyperspec, it's a keyword argument passed to OPEN.  I tried :latin-1 on
SBCL 0.9.2 but that didn't help.  I then downloaded SBCL
0.9.9 and tried your suggestion of :external-format :iso-8859-1 and was
able to read my file, I also tried :latin-1 and it worked as well.

I'm wondering if there is a utility that can determine what encoding a
text file uses.  The file I was processing was downloaded from the net and
didn't have any indication as to the encoding.

Thanks again.
--dennis

From: Harald Hanche-Olsen
Subject: Re: SBCL & UTF-8 Errors
Date: Sun, 12 Feb 2006 22:28:06 +0000
Message-ID: <pcomzgwyzyh.fsf@shuttle.math.ntnu.no>

+ Dennis Dunn <·····@insight.rr.com>:

| I'm wondering if there is a utility that can determine what encoding
| a text file uses.

This is impossible in principle, though I suppose you could use
various kinds of heuristics to make reasonable guesses.  If the file
uses Unicode /and/ begins with U+FEFF ZERO WIDTH NO-BREAK SPACE
(previously known as BYTE ORDER MARK), then at least you can tell the
difference between UTF-8 and various other ways to encode Unicode (see
the Unicode docs if you must).  But to tell one 8-bit encodig from
another, as the various iso-8859-x encodings, you must understand the
contents of the file.  That might be a nice little AI project, I
imagine.

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell

From: Ivan Boldyrev
Subject: Re: SBCL & UTF-8 Errors
Date: Mon, 13 Feb 2006 16:15:07 +0000
Message-ID: <btn6c3-tsd.ln1@ibhome.cgitftp.uiggm.nsc.ru>

On 9384 day of my life Dennis Dunn wrote:
> I'm wondering if there is a utility that can determine what encoding a
> text file uses.  The file I was processing was downloaded from the net and
> didn't have any indication as to the encoding.

http://trific.ath.cx/software/enca/

"ENCA detects the character coding of a file and converts it if
desired".

-- 
Ivan Boldyrev

       Assembly of a Japanese bicycle requires greatest peace of spirit.

From: Dennis Dunn
Subject: Re: SBCL & UTF-8 Errors
Date: Tue, 14 Feb 2006 03:49:11 +0000
Message-ID: <pan.2006.02.14.03.48.56.666777@insight.rr.com>

Hi,

Thanks for this ENCA link, I'll take a look at it.  I'm pretty much
finished with the file I was working on but it's good to know what's
available.

--dennis