I've got problem how to read unicode stream in SBCL.
My simple example is copied from mailing list of common-lisp.net
(defun crlf (s)
(write-char #\Return s)
(write-char #\Newline s))
(defun http-get (site page)
(with-open-stream (s (trivial-sockets:open-stream site 80))
(format s "GET ~A HTTP/1.0" page) (crlf s)
(format s "Host: ~A" site) (crlf s)
(crlf s)
(force-output s)
(loop for line = (read-line s nil nil)
while line
do (format t "~A~%" line))))
(http-get "www.google.co.jp" "/")
==>
debugger invoked on a SB-INT:STREAM-DECODING-ERROR:
decoding error on stream
#<SB-SYS:FD-STREAM for "a constant string" {4A49EDC9}>
(:EXTERNAL-FORMAT
:UTF-8):
the octet sequence (131) cannot be decoded.
Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL.
restarts (invokable by number or by possibly-abbreviated name):
0: [ATTEMPT-RESYNC ] Attempt to resync the stream at a character
character
boundary and continue.
1: [FORCE-END-OF-FILE] Force an end of file.
2: [ABORT ] Exit debugger, returning to top level.
(SB-INT:STREAM-DECODING-ERROR
#<SB-SYS:FD-STREAM for "a constant string" {4A49EDC9}>
(131))
0]
I've set the local env as:
LANG=en_US.UTF-8
LC_CTYPE=ja_JP.UTF-8
And :sb-unicode are listed in *features*
Can anybody tell me how to fix this?
"Pebblestone" <··········@gmail.com> writes:
> (http-get "www.google.co.jp" "/")
www.google.co.jp doesn't send utf-8, but sends Shift-JIS, at least
with the HTTP request you sent.
> ==>
>
> debugger invoked on a SB-INT:STREAM-DECODING-ERROR:
> decoding error on stream
> #<SB-SYS:FD-STREAM for "a constant string" {4A49EDC9}>
> (:EXTERNAL-FORMAT
> :UTF-8):
> the octet sequence (131) cannot be decoded.
Not all Shift-JIS sequences are valid UTF-8 sequences (and even if
they were, they would mean something different).
> I've set the local env as:
> LANG=en_US.UTF-8
> LC_CTYPE=ja_JP.UTF-8
>
> And :sb-unicode are listed in *features*
>
> Can anybody tell me how to fix this?
The HTTP protocol, despite initial appearances, is a binary protocol.
If you want to be maximally correct, you should use a binary (that is,
(unsigned-byte 8) in this case) stream, and do any conversions to and
from text yourself on subsections of this stream, after parsing the
HTTP headers and the <meta> section in the head of the HTML.
You could also use a bivalent stream, which is obtained in SBCL by
specifying :element-type :default to calls opening streams. I'm not
sure whether (setf stream-external-format) has been implemented yet,
which would be a very useful operator in your case.
Christophe
Could you tell me how to modify the program?
I changed
(trivial-sockets:open-stream site 80)
==>
(trivial-sockets:open-stream site 80 :external-format '(unsigned-byte
8))
which signaled an error during the execution.
Christophe Rhodes wrote:
> "Pebblestone" <··········@gmail.com> writes:
>
> > (http-get "www.google.co.jp" "/")
>
> www.google.co.jp doesn't send utf-8, but sends Shift-JIS, at least
> with the HTTP request you sent.
>
> > ==>
> >
> > debugger invoked on a SB-INT:STREAM-DECODING-ERROR:
> > decoding error on stream
> > #<SB-SYS:FD-STREAM for "a constant string" {4A49EDC9}>
> > (:EXTERNAL-FORMAT
> > :UTF-8):
> > the octet sequence (131) cannot be decoded.
>
> Not all Shift-JIS sequences are valid UTF-8 sequences (and even if
> they were, they would mean something different).
>
> > I've set the local env as:
> > LANG=en_US.UTF-8
> > LC_CTYPE=ja_JP.UTF-8
> >
> > And :sb-unicode are listed in *features*
> >
> > Can anybody tell me how to fix this?
>
> The HTTP protocol, despite initial appearances, is a binary protocol.
> If you want to be maximally correct, you should use a binary (that is,
> (unsigned-byte 8) in this case) stream, and do any conversions to and
> from text yourself on subsections of this stream, after parsing the
> HTTP headers and the <meta> section in the head of the HTML.
>
> You could also use a bivalent stream, which is obtained in SBCL by
> specifying :element-type :default to calls opening streams. I'm not
> sure whether (setf stream-external-format) has been implemented yet,
> which would be a very useful operator in your case.
>
> Christophe
"Pebblestone" <··········@gmail.com> writes:
> Could you tell me how to modify the program?
>
> I changed
> (trivial-sockets:open-stream site 80)
> ==>
> (trivial-sockets:open-stream site 80 :external-format '(unsigned-byte 8))
>
> which signaled an error during the execution.
'(unsigned-byte 8) is not a valid argument for the :external-format
keyword. It is valid as :element-type, though: perhaps you meant
that?
Christophe
"Pebblestone" <··········@gmail.com> writes:
> Could you tell me how to modify the program?
>
> I changed
> (trivial-sockets:open-stream site 80)
> ==>
> (trivial-sockets:open-stream site 80 :external-format '(unsigned-byte
> 8))
>
> which signaled an error during the execution.
Obviously.
(defvar *ascii* " !\"#$%&'()*+,-./0123456789:;<=>·@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~")
(defun string-encode-to-ascii (string)
(coerce
(loop
:for ch :across string
:for code = (case ch
((#\linefeed) 10)
((#\return) 13)
(otherwise (position ch *ascii* :test (function char=))))
:when (null code)
:do (error "~S contains an non-ASCII-encodable character: ~C" string ch)
:when (not (null code)) :collect (+ 32 code)) 'vector))
(defun crlf (s)
(write-sequence (string-encode-to-ascii #(#\Return #\LineFeed)) s)
;; A newline can be mapped to cr, lf, cr+lf, or anything else by lisp.
(defun http-get (site page)
(with-open-stream (s (trivial-sockets:open-stream site 80))
(write-sequence
(string-encode-to-ascii (format nil "GET ~A HTTP/1.0" page)) s) (crlf s)
(write-sequence
(string-encode-to-ascii (format nil "Host: ~A" site)) s) (crlf s)
(crlf s)
(force-output s)
(loop
:with buffer = (make-array 2048 :element-type '(unsigned-byte 8))
:for read-count = (read-sequence buffer s)
:while (plusp read-count) ; but this is not really what you want to test!
:do (format t "~A~%" (string-decode-from-whatever-encoding
buffer read-count)))))
string-decode-from-whatever-encoding could be string-decode-from-shift-jis
if that's what you get.
If I were you, I'd rather use clisp which has the best support for
encodings...
--
__Pascal Bourguignon__ http://www.informatimago.com/
ATTENTION: Despite any other listing of product contents found
herein, the consumer is advised that, in actuality, this product
consists of 99.9999999999% empty space.