From: Pebblestone
Subject: Unicode Problem: SBCL(running on FreeBSD)
Date: 
Message-ID: <1152189944.396991.267670@a14g2000cwb.googlegroups.com>
I've got problem how to read unicode stream in SBCL.
My simple example is copied from mailing list of common-lisp.net

(defun crlf (s)
  (write-char #\Return s)
  (write-char #\Newline s))

(defun http-get (site page)
  (with-open-stream (s (trivial-sockets:open-stream site 80))
    (format s "GET ~A HTTP/1.0" page) (crlf s)
    (format s "Host: ~A" site) (crlf s)
    (crlf s)
    (force-output s)
    (loop for line = (read-line s nil nil)
	 while line
	 do (format t "~A~%" line))))

(http-get "www.google.co.jp" "/")

==>

debugger invoked on a SB-INT:STREAM-DECODING-ERROR:
  decoding error on stream
  #<SB-SYS:FD-STREAM for "a constant string" {4A49EDC9}>
(:EXTERNAL-FORMAT
  :UTF-8):
    the octet sequence (131) cannot be decoded.

Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL.

restarts (invokable by number or by possibly-abbreviated name):
  0: [ATTEMPT-RESYNC   ] Attempt to resync the stream at a character
character
                         boundary and continue.
  1: [FORCE-END-OF-FILE] Force an end of file.
  2: [ABORT            ] Exit debugger, returning to top level.

(SB-INT:STREAM-DECODING-ERROR
 #<SB-SYS:FD-STREAM for "a constant string" {4A49EDC9}>
 (131))
0]


I've set the local env as:
LANG=en_US.UTF-8
LC_CTYPE=ja_JP.UTF-8

And :sb-unicode are listed in *features*

Can anybody tell me how to fix this?

From: Christophe Rhodes
Subject: Re: Unicode Problem: SBCL(running on FreeBSD)
Date: 
Message-ID: <sqodw2svxa.fsf@cam.ac.uk>
"Pebblestone" <··········@gmail.com> writes:

> (http-get "www.google.co.jp" "/")

www.google.co.jp doesn't send utf-8, but sends Shift-JIS, at least
with the HTTP request you sent.

> ==>
>
> debugger invoked on a SB-INT:STREAM-DECODING-ERROR:
>   decoding error on stream
>   #<SB-SYS:FD-STREAM for "a constant string" {4A49EDC9}>
> (:EXTERNAL-FORMAT
>   :UTF-8):
>     the octet sequence (131) cannot be decoded.

Not all Shift-JIS sequences are valid UTF-8 sequences (and even if
they were, they would mean something different).

> I've set the local env as:
> LANG=en_US.UTF-8
> LC_CTYPE=ja_JP.UTF-8
>
> And :sb-unicode are listed in *features*
>
> Can anybody tell me how to fix this?

The HTTP protocol, despite initial appearances, is a binary protocol.
If you want to be maximally correct, you should use a binary (that is,
(unsigned-byte 8) in this case) stream, and do any conversions to and
from text yourself on subsections of this stream, after parsing the
HTTP headers and the <meta> section in the head of the HTML.

You could also use a bivalent stream, which is obtained in SBCL by
specifying :element-type :default to calls opening streams.  I'm not
sure whether (setf stream-external-format) has been implemented yet,
which would be a very useful operator in your case.

Christophe
From: Pebblestone
Subject: Re: Unicode Problem: SBCL(running on FreeBSD)
Date: 
Message-ID: <1152194892.042655.67810@a14g2000cwb.googlegroups.com>
Could you tell me how to modify the program?

I changed
(trivial-sockets:open-stream site 80)
==>
(trivial-sockets:open-stream site 80 :external-format '(unsigned-byte
8))

which signaled an error during the execution.



Christophe Rhodes wrote:
> "Pebblestone" <··········@gmail.com> writes:
>
> > (http-get "www.google.co.jp" "/")
>
> www.google.co.jp doesn't send utf-8, but sends Shift-JIS, at least
> with the HTTP request you sent.
>
> > ==>
> >
> > debugger invoked on a SB-INT:STREAM-DECODING-ERROR:
> >   decoding error on stream
> >   #<SB-SYS:FD-STREAM for "a constant string" {4A49EDC9}>
> > (:EXTERNAL-FORMAT
> >   :UTF-8):
> >     the octet sequence (131) cannot be decoded.
>
> Not all Shift-JIS sequences are valid UTF-8 sequences (and even if
> they were, they would mean something different).
>
> > I've set the local env as:
> > LANG=en_US.UTF-8
> > LC_CTYPE=ja_JP.UTF-8
> >
> > And :sb-unicode are listed in *features*
> >
> > Can anybody tell me how to fix this?
>
> The HTTP protocol, despite initial appearances, is a binary protocol.
> If you want to be maximally correct, you should use a binary (that is,
> (unsigned-byte 8) in this case) stream, and do any conversions to and
> from text yourself on subsections of this stream, after parsing the
> HTTP headers and the <meta> section in the head of the HTML.
>
> You could also use a bivalent stream, which is obtained in SBCL by
> specifying :element-type :default to calls opening streams.  I'm not
> sure whether (setf stream-external-format) has been implemented yet,
> which would be a very useful operator in your case.
> 
> Christophe
From: Christophe Rhodes
Subject: Re: Unicode Problem: SBCL(running on FreeBSD)
Date: 
Message-ID: <sqlkr63htr.fsf@cam.ac.uk>
"Pebblestone" <··········@gmail.com> writes:

> Could you tell me how to modify the program?
>
> I changed
> (trivial-sockets:open-stream site 80)
> ==>
> (trivial-sockets:open-stream site 80 :external-format '(unsigned-byte 8))
>
> which signaled an error during the execution.

'(unsigned-byte 8) is not a valid argument for the :external-format
keyword.  It is valid as :element-type, though: perhaps you meant
that?

Christophe
From: Pascal Bourguignon
Subject: Re: Unicode Problem: SBCL(running on FreeBSD)
Date: 
Message-ID: <87u05uq0l8.fsf@thalassa.informatimago.com>
"Pebblestone" <··········@gmail.com> writes:

> Could you tell me how to modify the program?
>
> I changed
> (trivial-sockets:open-stream site 80)
> ==>
> (trivial-sockets:open-stream site 80 :external-format '(unsigned-byte
> 8))
>
> which signaled an error during the execution.

Obviously.

(defvar *ascii* " !\"#$%&'()*+,-./0123456789:;<=>·@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~")

(defun string-encode-to-ascii (string)
  (coerce
   (loop
      :for ch :across string
      :for code = (case ch
                     ((#\linefeed) 10)
                     ((#\return)   13)
                     (otherwise (position ch *ascii* :test (function char=))))
      :when (null code)
      :do (error "~S contains an non-ASCII-encodable character: ~C" string ch)
      :when (not (null code))  :collect (+ 32 code)) 'vector))

                  

(defun crlf (s)
  (write-sequence (string-encode-to-ascii #(#\Return #\LineFeed)) s)

;; A newline can be mapped to cr, lf, cr+lf, or anything else by lisp.


(defun http-get (site page)
  (with-open-stream (s (trivial-sockets:open-stream site 80))
    (write-sequence
       (string-encode-to-ascii (format nil "GET ~A HTTP/1.0" page)) s) (crlf s)
    (write-sequence 
       (string-encode-to-ascii (format nil "Host: ~A" site)) s) (crlf s)
    (crlf s)
    (force-output s)
    (loop
      :with buffer = (make-array 2048 :element-type '(unsigned-byte 8))
      :for read-count = (read-sequence buffer s)
	  :while (plusp read-count) ; but this is not really what you want to test!
	  :do (format t "~A~%" (string-decode-from-whatever-encoding
                                 buffer read-count)))))

string-decode-from-whatever-encoding could be string-decode-from-shift-jis
if that's what you get.


If I were you, I'd rather use clisp which has the best support for
encodings...

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

ATTENTION: Despite any other listing of product contents found
herein, the consumer is advised that, in actuality, this product
consists of 99.9999999999% empty space.
From: Pebblestone
Subject: Re: Unicode Problem: SBCL(running on FreeBSD)
Date: 
Message-ID: <1152197412.617679.289970@s26g2000cwa.googlegroups.com>
Thank you :)


Pascal Bourguignon wrote:
> "Pebblestone" <··········@gmail.com> writes:
>
> > Could you tell me how to modify the program?
> >
> > I changed
> > (trivial-sockets:open-stream site 80)
> > ==>
> > (trivial-sockets:open-stream site 80 :external-format '(unsigned-byte
> > 8))
> >
> > which signaled an error during the execution.
>
> Obviously.
>
> (defvar *ascii* " !\"#$%&'()*+,-./0123456789:;<=>·@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~")
>
> (defun string-encode-to-ascii (string)
>   (coerce
>    (loop
>       :for ch :across string
>       :for code = (case ch
>                      ((#\linefeed) 10)
>                      ((#\return)   13)
>                      (otherwise (position ch *ascii* :test (function char=))))
>       :when (null code)
>       :do (error "~S contains an non-ASCII-encodable character: ~C" string ch)
>       :when (not (null code))  :collect (+ 32 code)) 'vector))
>
>
>
> (defun crlf (s)
>   (write-sequence (string-encode-to-ascii #(#\Return #\LineFeed)) s)
>
> ;; A newline can be mapped to cr, lf, cr+lf, or anything else by lisp.
>
>
> (defun http-get (site page)
>   (with-open-stream (s (trivial-sockets:open-stream site 80))
>     (write-sequence
>        (string-encode-to-ascii (format nil "GET ~A HTTP/1.0" page)) s) (crlf s)
>     (write-sequence
>        (string-encode-to-ascii (format nil "Host: ~A" site)) s) (crlf s)
>     (crlf s)
>     (force-output s)
>     (loop
>       :with buffer = (make-array 2048 :element-type '(unsigned-byte 8))
>       :for read-count = (read-sequence buffer s)
> 	  :while (plusp read-count) ; but this is not really what you want to test!
> 	  :do (format t "~A~%" (string-decode-from-whatever-encoding
>                                  buffer read-count)))))
>
> string-decode-from-whatever-encoding could be string-decode-from-shift-jis
> if that's what you get.
>
>
> If I were you, I'd rather use clisp which has the best support for
> encodings...
>
> --
> __Pascal Bourguignon__                     http://www.informatimago.com/
>
> ATTENTION: Despite any other listing of product contents found
> herein, the consumer is advised that, in actuality, this product
> consists of 99.9999999999% empty space.