Trouble decoding URL encoded Japanese characters

From: ·················@gmail.com
Subject: Trouble decoding URL encoded Japanese characters
Date: Mon, 25 Sep 2006 04:38:34 +0000
Message-ID: <1159159114.600264.284490@d34g2000cwd.googlegroups.com>

I have what is hopefully a easy question to answer.  I'm trying to
convert some UTF-8 URL-encoded strings to UTF-8 strings.  The URL
strings are in Japanese and look like this:

"%E6%97%A5%E6%9C%AC%E3%81%AE%E9%A3%9F%E3%81%B9%E7%89%A9%E3%81%AF%E3%81%A8%E3%81%A6%E3%82%82%E3%81%8D%E3%82%8C%E3%81%84%E3%81%A7%E3%81%8A%E3%81%84%E3%81%97%E3%81%A7%E3%81%99%E3%81%AD%EF%BC%9F"

After the conversion it should look like this:

日本の食べ物はとてもきれいでおいしですね？

I'm using slime with sbcl or clisp with UTF-8 support turned on.  I've
tried url decode functions in arnesi, tbnl and araneida.  All of them
appear to return garbage strings and cause slime to die if the strings
get printed.  I know unicode is working in the REPL because I can do
this:

CL-USER> "日本の食"
"日本の食"
CL-USER> (format t "~x" (char-code (aref "日本の食" 0)))
65E5
NIL

Note that #x65e5 are not the same hex numbers in the url-encoded string
above.  This should be a clue but it is not obvious to me.  Any
pointers to help me out would be appreciated.

Here's my sample code:

(defun unicode-test ()
  (with-open-file (out (merge-pathnames #p"unicode.txt"
(util:source-file-directory))
                       :direction :output :if-exists :supersede
:if-does-not-exist :create
                       :external-format :UTF-8)
    (print (araneida:urlstring-unescape
"%e6%97%a5%e6%9c%ac%e3%81%ae%e9%a3%9f%e3%81%b9%e7%89%a9%e3%81%af%e3%81%a8%e3%81%a6%e3%82%82%e3%81%8d%e3%82%8c%e3%81%84%e3%81%a7%e3%81%8a%e3%81%84%e3%81%97%e3%81%a7%e3%81%99%e3%81%ad%ef%bc%9f")
out)
    nil))

Thanks in advance,

AnthonyF

Re: Trouble decoding URL encoded Japanese characters sunwukong
Re: Trouble decoding URL encoded Japanese characters Pascal Bourguignon
- Re: Trouble decoding URL encoded Japanese characters ·················@gmail.com

From: sunwukong
Subject: Re: Trouble decoding URL encoded Japanese characters
Date: Mon, 25 Sep 2006 06:02:58 +0000
Message-ID: <1159164178.299421.215420@b28g2000cwb.googlegroups.com>

> "%E6%97%A5%E6%9C...
[...]
> CL-USER> (format t "~x" (char-code (aref "日本の食" 0)))
> 65E5
> NIL

E697A5 is the real UTF-8 representation of the character 日 (as can be
seen if you press C-u C-x = in Emacs while point is on that character).

Look at the following function (found at
http://cl-debian.alioth.debian.org/repository/pvaneynd/flexi-streams/output.lisp):

(defun translate-char-utf-8 (char-code)
  "Converts the character denoted by the character code CHAR-CODE
into a list of up to six octets which correspond to its UTF-8
encoding."
  (let* (result
         (count
          (cond ((< char-code #x80) (push char-code result) nil)
                ((< char-code #x800) 1)
                ((< char-code #x10000) 2)
                ((< char-code #x200000) 3)
                ((< char-code #x4000000) 4)
                (t 5))))
    (when count
      (loop for rest = char-code then (ash rest -6)
            repeat count
            do (push (logior #b10000000
                             (logand #b111111 rest))
                     result)
            finally (push (logior (logand #b11111111
                                          (ash #b1111110 (- 6 count)))
                                  rest)
                          result)))
    result))

Now
CL-USER> (translate-char-utf-8 (char-code #\日))
> (230 151 165)
... that is, E697A5.

I hope this gives you some pointers.

Peter

PS:
> 日本の食べ物はとてもきれいでおいしですね？
Should be:
日本の食べ物はとてもきれいでおいしいですね？

From: Pascal Bourguignon
Subject: Re: Trouble decoding URL encoded Japanese characters
Date: Mon, 25 Sep 2006 13:29:24 +0000
Message-ID: <87irjc13d7.fsf@thalassa.informatimago.com>

·················@gmail.com writes:

> I have what is hopefully a easy question to answer.  I'm trying to
> convert some UTF-8 URL-encoded strings to UTF-8 strings.  The URL
> strings are in Japanese and look like this:
>
> "%E6%97%A5%E6%9C%AC%E3%81%AE%E9%A3%9F%E3%81%B9%E7%89%A9%E3%81%AF%E3%81%A8%E3%81%A6%E3%82%82%E3%81%8D%E3%82%8C%E3%81%84%E3%81%A7%E3%81%8A%E3%81%84%E3%81%97%E3%81%A7%E3%81%99%E3%81%AD%EF%BC%9F"
>
> After the conversion it should look like this:
>
> 日本の食べ物はとてもきれいでおいしですね？

You have two levels of encoding/decoding here.
First decode the ascii %HL string to a byte sequence,
then utf-8 decode the byte sequence to a character string.

sunwukong showed how the utf-8 decode can be done. It is also
implemented in sbcl by sb-ext:octets-to-string and in clisp by
ext:convert-string-from-bytes, that the libraries such as arnesi encapsulate
under a common API.


Decoding the %hc can be done with:

(defparameter *ascii* " !\"#$%&'()*+,-./0123456789:;<=>·@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~")

(defun char-to-ascii-byte (char)
  (let ((p (position char *ascii* :test (function char=))))
    (if p (+ 32 p) (error "Not an ASCII character: ~C" char))))
                                                                


(defun uri-decode (uri-string)
  (loop
     :with buffer = (make-array (length uri-string)
                                :fill-pointer 0 :element-type '(unsigned-byte 8))
     :with i = 0
     :while (< i (length uri-string))
     :do (if (char= #\% (aref uri-string i))
             (if (< (- (length uri-string) 3) i)
                 (error "% appears to close to the end of the URI: ~S"
                        uri-string)
                 (progn
                   (vector-push (parse-integer uri-string :radix 16
                                               :start (+ 1 i) :end (+ 3 i)
                                               :junk-allowed nil)
                                buffer)
                   (incf i 3)))
             (progn
               (vector-push (char-to-ascii-byte (aref uri-string i)))
               (incf i)))
     :finally (return buffer))); or (return (copy-seq buffer))


(uri-decode  "%E6%97%A5%E6%9C%AC%E3%81%AE%E9%A3%9F%E3%81%B9%E7%89%A9%E3%81%AF%E3%81%A8%E3%81%A6%E3%82%82%E3%81%8D%E3%82%8C%E3%81%84%E3%81%A7%E3%81%8A%E3%81%84%E3%81%97%E3%81%A7%E3%81%99%E3%81%AD%EF%BC%9F")
-->
#(230 151 165 230 156 172 227 129 174 233 163 159 227 129 185 231 137 169 227
  129 175 227 129 168 227 129 166 227 130 130 227 129 141 227 130 140 227 129
  132 227 129 167 227 129 138 227 129 132 227 129 151 227 129 167 227 129 153
  227 129 173 239 188 159)

#+clisp (ext:convert-string-from-bytes (uri-decode  "%E6%97%A5%E6%9C%AC%E3%81%AE%E9%A3%9F%E3%81%B9%E7%89%A9%E3%81%AF%E3%81%A8%E3%81%A6%E3%82%82%E3%81%8D%E3%82%8C%E3%81%84%E3%81%A7%E3%81%8A%E3%81%84%E3%81%97%E3%81%A7%E3%81%99%E3%81%AD%EF%BC%9F") charset:utf-8)
-->
"日本の食べ物はとてもきれいでおいしですね？"



> Here's my sample code:
>
> (defun unicode-test ()
>   (with-open-file (out (merge-pathnames #p"unicode.txt"
> (util:source-file-directory))
>                        :direction :output :if-exists :supersede
> :if-does-not-exist :create
>                        :external-format :UTF-8)
>     (print (araneida:urlstring-unescape
> "%e6%97%a5%e6%9c%ac%e3%81%ae%e9%a3%9f%e3%81%b9%e7%89%a9%e3%81%af%e3%81%a8%e3%81%a6%e3%82%82%e3%81%8d%e3%82%8c%e3%81%84%e3%81%a7%e3%81%8a%e3%81%84%e3%81%97%e3%81%a7%e3%81%99%e3%81%ad%ef%bc%9f")
> out)
>     nil))

You seem to be printing a string.  This is incorrect.  The mere %-
_unescaping_ of a URI can only return a byte vector, not characters.

Perhaps araneida is assuming that it can store bytes into a string
using code-char (and recover the bytes with char-code).  This is
implementation dependant (it might work on most implementations, but
you lose the "type" information and you can easily do errors as you
did here).

So the first thing would be to convert back the string into a byte vector:

(map 'vector (function char-code) (araneida:urlstring-unescape "%e6%97%a5%e6%9c%ac%e3%81%ae%e9%a3%9f%e3%81%b9%e7%89%a9%e3%81%af%e3%81%a8%e3%81%a6%e3%82%82%e3%81%8d%e3%82%8c%e3%81%84%e3%81%a7%e3%81%8a%e3%81%84%e3%81%97%e3%81%a7%e3%81%99%e3%81%ad%ef%bc%9f"))

Then you can convert this byte vector into a string.  If you assume
that the byte vector is encoded in UTF-8 (as we are since the
beginning of this post, but remember that this is not what most
browsers do; my browser converts a URL such as http://localhost/été to
http://localhost/%E9t%E9 which is the same encoded in ISO-8859-1 and
the corresponding byte sequence uri-escaped), then you need to convert
the byte sequence to characters with:

#+clisp (ext:convert-string-from-bytes (map 'vector (function char-code) (araneida:urlstring-unescape "%e6%97%a5%e6%9c%ac%e3%81%ae%e9%a3%9f%e3%81%b9%e7%89%a9%e3%81%af%e3%81%a8%e3%81%a6%e3%82%82%e3%81%8d%e3%82%8c%e3%81%84%e3%81%a7%e3%81%8a%e3%81%84%e3%81%97%e3%81%a7%e3%81%99%e3%81%ad%ef%bc%9f")) charset:utf-8)

and you can then write this string to a file where it will be encoded
into utf-8 if you ask it (but you could encode it into utf-16, or
jp-2022-* or anything else as long as all the characters in the
strings are in the character set of the encoding used.



-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

"A TRUE Klingon warrior does not comment his code!"

From: ·················@gmail.com
Subject: Re: Trouble decoding URL encoded Japanese characters
Date: Mon, 25 Sep 2006 17:03:35 +0000
Message-ID: <1159203815.111312.224280@b28g2000cwb.googlegroups.com>

Thank you both.  It makes a bit more sense now.  I'll give those
solutions a try.

Anthony