From: franks
Subject: Umlaut encoding
Date: 
Message-ID: <1170282289.814941.244820@v33g2000cwv.googlegroups.com>
I'm receiving an xml-page from the internet with the client of aserve.
The heading is <?xml version='1.0' encoding='UTF-8'?>.
The page contains the string "München", which has umlaut u in it.
If I save the page as a text file and look at it with a simple text
editor, it looks ok.
Now I parse the page with s-xml, and the interseting string prints as
"München",
i.e. the umlaut u becomes tilde A plus one quarter. In the inspector
it looks the
same. I tried to bind *DEFAULT-EXTERNAL-FORMAT* for calling parse-xml-
string by

(let ((doc)
    (stream::*DEFAULT-EXTERNAL-FORMAT* (list :utf-8 :EOL-
STYLE :CRLF)))
        (setf doc (s-xml:parse-xml-string xml-page :output-type :xml-
struct)).

This doesn't change anything. Also, this is the same in ACL and LW,
mswin.
I'm afraid I understood nothing about encodings yet.
What can I do - can someone help me please ?
Thanks, Frank
From: Alex Mizrahi
Subject: Re: Umlaut encoding
Date: 
Message-ID: <45c12072$0$49201$14726298@news.sunsite.dk>
(message (Hello 'franks)
(you :wrote  :on '(31 Jan 2007 14:24:49 -0800))
(

 f> I'm receiving an xml-page from the internet with the client of aserve.
 f> The heading is <?xml version='1.0' encoding='UTF-8'?>.
 f> The page contains the string "M�nchen", which has umlaut u in it.
 f> If I save the page as a text file and look at it with a simple text
 f> editor, it looks ok.
 f> Now I parse the page with s-xml, and the interseting string prints as
 f> "München",
 f> i.e. the umlaut u becomes tilde A plus one quarter.

that's how utf-8 looks like in simple (e.g. latin-1) encoding.

 f>  In the inspector it looks the
 f> same. I tried to bind *DEFAULT-EXTERNAL-FORMAT* for calling parse-xml-
 f> string by

it's not xml-parser's problem.

you read UTF-8 encoded data from stream, you should somehow specify that 
when you do that, so implementation will decode UTF-8 to it's internal 
represenation. or you can read the data, and then transcode it.

xml-parser should handle that if it operates on raw bytes. but it's not --  
you give it a lisp string, that is supposed to be already decoded -- lisp 
string is not just bytes, it can be unicode string or whatever.

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
"People who lust for the Feel of keys on their fingertips (c) Inity")