Hello,
I am trying to process a file line-by-line and everything was working
perfectly until I get this message :
decoding error on stream
#<SB-SYS:FD-STREAM for "file
\"/home/ansofive/projects/camc/lexicon/a.txt\"" {A1C48E9}>
(:EXTERNAL-FORMAT :UTF-8):
the octet sequence (142) cannot be decoded.
[Condition of type SB-INT:STREAM-DECODING-ERROR]
Restarts:
0: [ATTEMPT-RESYNC] Attempt to resync the stream at a character
character boundary and continue.
1: [FORCE-END-OF-FILE] Force an end of file.
2: [ABORT-REQUEST] Abort handling SLIME request.
3: [TERMINATE-THREAD] Terminate this thread (3063307168)
I'm running SBCL 0.9.2 on an Ubuntu 5.10 box. The offending characters
seem to show up like this in vi: <Be> I've tried to find an answer on
Google and in the Hyperspec but I didn't find anything that seemed to
help. I did see that SBCL had a unicode bug back around 0.8.something but
it had been fixed.
For what it's worth, I'm trying to stuff the Moby part-of-speech lexicon
into a PostgreSQL database table.
Thanks for any help on where I should go from here.
--dennis
Dennis Dunn wrote:
> I am trying to process a file line-by-line and everything was working
> perfectly until I get this message :
>
> decoding error on stream
> #<SB-SYS:FD-STREAM for "file
> \"/home/ansofive/projects/camc/lexicon/a.txt\"" {A1C48E9}>
> (:EXTERNAL-FORMAT :UTF-8):
> the octet sequence (142) cannot be decoded.
> [Condition of type SB-INT:STREAM-DECODING-ERROR]
>
> Restarts:
> 0: [ATTEMPT-RESYNC] Attempt to resync the stream at a character
> character boundary and continue.
> 1: [FORCE-END-OF-FILE] Force an end of file.
> 2: [ABORT-REQUEST] Abort handling SLIME request.
> 3: [TERMINATE-THREAD] Terminate this thread (3063307168)
>
> I'm running SBCL 0.9.2 on an Ubuntu 5.10 box. The offending characters
> seem to show up like this in vi: <Be>
I once had a problem which looked vaguely similar, and I solved it with
these lines in my .emacs:
;;http://www.cliki.net/SLIME%20Tips#unicode
(set-language-environment "UTF-8")
(setq slime-net-coding-system 'utf-8-unix)
But as I'm not exactly an Emacs guru, I have no idea whether this is
useful for you.
Tayssir
Hello,
> I once had a problem which looked vaguely similar, and I solved it
> with these lines in my .emacs:
>
> ;;http://www.cliki.net/SLIME%20Tips#unicode (set-language-environment
> "UTF-8")
> (setq slime-net-coding-system 'utf-8-unix)
I tried this but it didn't help. I ended up downloading SBCL 0.9.9 and
using :external-format.
Thanks,
--dennis
Dennis Dunn <·····@insight.rr.com> wrote:
> Hello,
>
> I am trying to process a file line-by-line and everything was working
> perfectly until I get this message :
>
> decoding error on stream
> #<SB-SYS:FD-STREAM for "file
> \"/home/ansofive/projects/camc/lexicon/a.txt\"" {A1C48E9}>
> (:EXTERNAL-FORMAT :UTF-8):
> the octet sequence (142) cannot be decoded.
> [Condition of type SB-INT:STREAM-DECODING-ERROR]
You're trying to read a file as UTF-8 (probably because your system
locale settings), but it doesn't contain valid UTF-8 encoded data.
Either change your locale to some non-unicode one, fix the file to
contain valid UTF-8, or use the :EXTERNAL-FORMAT keyword parameter to
OPEN / WITH-OPEN-FILE to specify the encoding. e.g.
(with-open-file (stream "/foo/bar" :external-format :iso-8859-1)
...)
--
Juho Snellman
Hello,
<snip>
> contain valid UTF-8, or use the :EXTERNAL-FORMAT keyword parameter to
> OPEN / WITH-OPEN-FILE to specify the encoding. e.g.
>
> (with-open-file (stream "/foo/bar" :external-format :iso-8859-1)
> ...)
Thanks for your reply. I had read the bit about :external-format in the
Hyperspec, it's a keyword argument passed to OPEN. I tried :latin-1 on
SBCL 0.9.2 but that didn't help. I then downloaded SBCL
0.9.9 and tried your suggestion of :external-format :iso-8859-1 and was
able to read my file, I also tried :latin-1 and it worked as well.
I'm wondering if there is a utility that can determine what encoding a
text file uses. The file I was processing was downloaded from the net and
didn't have any indication as to the encoding.
Thanks again.
--dennis
+ Dennis Dunn <·····@insight.rr.com>:
| I'm wondering if there is a utility that can determine what encoding
| a text file uses.
This is impossible in principle, though I suppose you could use
various kinds of heuristics to make reasonable guesses. If the file
uses Unicode /and/ begins with U+FEFF ZERO WIDTH NO-BREAK SPACE
(previously known as BYTE ORDER MARK), then at least you can tell the
difference between UTF-8 and various other ways to encode Unicode (see
the Unicode docs if you must). But to tell one 8-bit encodig from
another, as the various iso-8859-x encodings, you must understand the
contents of the file. That might be a nice little AI project, I
imagine.
--
* Harald Hanche-Olsen <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
when there is no ground whatsoever for supposing it is true.
-- Bertrand Russell
On 9384 day of my life Dennis Dunn wrote:
> I'm wondering if there is a utility that can determine what encoding a
> text file uses. The file I was processing was downloaded from the net and
> didn't have any indication as to the encoding.
http://trific.ath.cx/software/enca/
"ENCA detects the character coding of a file and converts it if
desired".
--
Ivan Boldyrev
Assembly of a Japanese bicycle requires greatest peace of spirit.
Hi,
Thanks for this ENCA link, I'll take a look at it. I'm pretty much
finished with the file I was working on but it's good to know what's
available.
--dennis