converting from say UTF-8 to LATIN-1

From: Sacha
Subject: converting from say UTF-8 to LATIN-1
Date: Fri, 04 Aug 2006 13:46:42 +0000
Message-ID: <6dIAg.2881$xo5.127101@phobos.telenet-ops.be>

Hello all,

I have this UTF-8 encoded file which i need to parse.
I've been able to load the file as UTF-8, and put it in an UTF-8 string, but 
that doesn't help with the parsing.

Using lispworks, it seems that the editor uses LATIN-1 encoding.
So that means the source code and strings defined within are latin-1 
encoded.

In order to do proper parsing i beleive i'll need to convert that UTF-8 file 
to LATIN-1.

I've been googling but didn't find anything that would help.

Thanks in advance

Sacha

Re: converting from say UTF-8 to LATIN-1 Tayssir John Gabbour
- Re: converting from say UTF-8 to LATIN-1 Sacha
Re: converting from say UTF-8 to LATIN-1 Carl Taylor
- Re: converting from say UTF-8 to LATIN-1 Sacha
  - Re: converting from say UTF-8 to LATIN-1 Christophe Rhodes
    - Re: converting from say UTF-8 to LATIN-1 Sacha
      - Re: converting from say UTF-8 to LATIN-1 Carl Taylor
        Re: converting from say UTF-8 to LATIN-1 Sacha

From: Tayssir John Gabbour
Subject: Re: converting from say UTF-8 to LATIN-1
Date: Fri, 04 Aug 2006 13:51:03 +0000
Message-ID: <1154699463.558771.211020@m79g2000cwm.googlegroups.com>

Sacha wrote:
> I have this UTF-8 encoded file which i need to parse.
> I've been able to load the file as UTF-8, and put it in an UTF-8 string, but
> that doesn't help with the parsing.
>
> Using lispworks, it seems that the editor uses LATIN-1 encoding.
> So that means the source code and strings defined within are latin-1
> encoded.

Would Flexi-streams do the job?
http://weitz.de/flexi-streams/

Tayssir

From: Sacha
Subject: Re: converting from say UTF-8 to LATIN-1
Date: Fri, 04 Aug 2006 17:37:46 +0000
Message-ID: <KBLAg.3197$zO4.104551@phobos.telenet-ops.be>

"Tayssir John Gabbour" <···········@yahoo.com> wrote in message 
·····························@m79g2000cwm.googlegroups.com...
> Sacha wrote:
>> I have this UTF-8 encoded file which i need to parse.
>> I've been able to load the file as UTF-8, and put it in an UTF-8 string, 
>> but
>> that doesn't help with the parsing.
>>
>> Using lispworks, it seems that the editor uses LATIN-1 encoding.
>> So that means the source code and strings defined within are latin-1
>> encoded.
>
> Would Flexi-streams do the job?
> http://weitz.de/flexi-streams/
>
> Tayssir
>


I'm currently trying to figure out how to make this work ...
In the meantime i'll use this simple function :

(defun ->latin-1 (string)
  (format t "~%Converting to Latin-1...")
  (let ((latin-1-string (make-string (length string)))
        (unrecognized-chars nil))
    (loop for char across string
          for index from 0
          do (setf (aref latin-1-string index)
                   (let ((new-char (find-external-char (char-code char) 
:latin-1)))
                     (if new-char
                         new-char
                       (progn
                         (push char unrecognized-chars)
                         ··@)))))
    (values latin-1-string
            unrecognized-chars)))


Thank you for your time
Sacha

From: Carl Taylor
Subject: Re: converting from say UTF-8 to LATIN-1
Date: Fri, 04 Aug 2006 16:43:57 +0000
Message-ID: <hPKAg.214214$mF2.107789@bgtnsc04-news.ops.worldnet.att.net>

Sacha wrote:
> Hello all,
>
> I have this UTF-8 encoded file which i need to parse.
> I've been able to load the file as UTF-8, and put it in an UTF-8
> string, but that doesn't help with the parsing.
>
> Using lispworks, it seems that the editor uses LATIN-1 encoding.
> So that means the source code and strings defined within are latin-1
> encoded.
>
> In order to do proper parsing i beleive i'll need to convert that
> UTF-8 file to LATIN-1.

Just for the hell of it try this:

( ....
   (with-open-file (in-stream file-path :direction :input :external-format 
:utf-8)
       (other-stuff ....... )))))

I don't have any utf-8 files, but when I try to read a latin-1 file in 
LispWorks with the above code I get the following break, so it seems LW 
knows about different formats.

Error: External format (:UTF-8 :EOL-STYLE :CRLF) produces characters of type 
SIMPLE-CHAR, which is not a subtype of the specified element-type BASE-CHAR.

Carl Taylor

From: Sacha
Subject: Re: converting from say UTF-8 to LATIN-1
Date: Fri, 04 Aug 2006 17:41:49 +0000
Message-ID: <xFLAg.3203$KO4.115826@phobos.telenet-ops.be>

"Carl Taylor" <··········@att.net> wrote in message 
····························@bgtnsc04-news.ops.worldnet.att.net...
> Sacha wrote:
>> Hello all,
>> In order to do proper parsing i beleive i'll need to convert that
>> UTF-8 file to LATIN-1.
>
>
> Just for the hell of it try this:
>
> ( ....
>   (with-open-file (in-stream file-path :direction :input :external-format 
> :utf-8)
>       (other-stuff ....... )))))
>

That's what i'm already doing, but when i read that into a string, the 
string
still is utf-8, while the parsing code i'm writing within the lispworks 
editor is
using latin-1. Comparing string literals across character encodings doesn't
work, that's why i need to convert the string.

Sacha

From: Christophe Rhodes
Subject: Re: converting from say UTF-8 to LATIN-1
Date: Fri, 04 Aug 2006 21:36:26 +0000
Message-ID: <sq64h8chet.fsf@cam.ac.uk>

"Sacha" <··@address.spam> writes:

> That's what i'm already doing, but when i read that into a string,
> the string still is utf-8, while the parsing code i'm writing within
> the lispworks editor is using latin-1. 

I'm no lispworks expert, but it would surprise me very much to learn
that this was the case; I would expect that all lispworks strings are
vectors of characters, not sequences of octets.

> Comparing string literals across character encodings doesn't work,
> that's why i need to convert the string.

Again, this doesn't sound right to me.  Are you sure that this is what
is going on?

Christophe

From: Sacha
Subject: Re: converting from say UTF-8 to LATIN-1
Date: Sat, 05 Aug 2006 02:02:32 +0000
Message-ID: <Y_SAg.3992$MO4.117876@phobos.telenet-ops.be>

"Christophe Rhodes" <·····@cam.ac.uk> wrote in message 
···················@cam.ac.uk...
> "Sacha" <··@address.spam> writes:
>
>> That's what i'm already doing, but when i read that into a string,
>> the string still is utf-8, while the parsing code i'm writing within
>> the lispworks editor is using latin-1.
>
> I'm no lispworks expert, but it would surprise me very much to learn
> that this was the case; I would expect that all lispworks strings are
> vectors of characters, not sequences of octets.
>


Allright, I made some more tests, and you're totaly right, I only had part
of the picture in mind.
My first tests were wrong i guess...

Thanks a lot !

Sacha

PS : hum sorry for replying to your email address =/

From: Carl Taylor
Subject: Re: converting from say UTF-8 to LATIN-1
Date: Sat, 05 Aug 2006 02:49:03 +0000
Message-ID: <zGTAg.550760$Fs1.406652@bgtnsc05-news.ops.worldnet.att.net>

"Sacha" <··@address.spam> wrote in message 
··························@phobos.telenet-ops.be...
>
> "Christophe Rhodes" <·····@cam.ac.uk> wrote in message ···················@cam.ac.uk...
>> "Sacha" <··@address.spam> writes:
>>
>>> That's what i'm already doing, but when i read that into a string,
>>> the string still is utf-8, while the parsing code i'm writing within
>>> the lispworks editor is using latin-1.
>>
>> I'm no lispworks expert, but it would surprise me very much to learn
>> that this was the case; I would expect that all lispworks strings are
>> vectors of characters, not sequences of octets.
>>
>
>
> Allright, I made some more tests, and you're totaly right, I only had part
> of the picture in mind.
> My first tests were wrong i guess...


FWIW here is some testing I did in LW out of curiosity. It seems the difference between 
Latin-1 and UTF-8 is one of character type, viz. simple-char vs. base-char.  In the code I 
copy a Latin-1 file to a UTF-8 version, then test the format type of the new file. 
Specifying :element-type causes the :external-format to be correctly propagated.  Did you 
specify both :element-type and :external-format in your code?

Carl Taylor

CL-USER 1 >
(defun copy-latin-1-to-utf-8 ()
  (let ((i-path #P"C:/Documents and Settings/Carl/my documents/lisp/BP/06-2006.txt")
        (o-path #P"C:/Documents and Settings/Carl/my documents/lisp/utf8-test.txt"))
    (with-open-file (i-stream i-path :direction :input
                                     :element-type 'base-char :external-format :latin-1)
      (with-open-file (o-stream o-path :direction :output :external-format :utf-8
                                       :element-type 'simple-char :if-exists :supersede)
        (loop for x-char = (read-char i-stream nil nil nil)
              while x-char
                 do (write-char x-char o-stream)
              finally
                return (format t "~%~S~%~S~2%"
                                 (stream-external-format i-stream)
                                 (stream-external-format o-stream)))))))
COPY-LATIN-1-TO-UTF-8

CL-USER 2 >
(compile *)
COPY-LATIN-1-TO-UTF-8
NIL
NIL


CL-USER 3 >
(copy-latin-1-to-utf-8)

(:LATIN-1 :EOL-STYLE :CRLF)
(:UTF-8 :EOL-STYLE :CRLF)

NIL


CL-USER 4 >
(with-open-file (i-stream
                 #P"C:/Documents and Settings/Carl/my documents/lisp/utf8-test.txt"
                 :direction :input :external-format :utf-8 :element-type 'simple-char)
  (stream-external-format i-stream))
(:UTF-8 :EOL-STYLE :CRLF)

From: Sacha
Subject: Re: converting from say UTF-8 to LATIN-1
Date: Sat, 05 Aug 2006 11:59:26 +0000
Message-ID: <yK%Ag.4763$IM4.52276@phobos.telenet-ops.be>

"Carl Taylor" <··········@att.net> wrote in message 
····························@bgtnsc05-news.ops.worldnet.att.net...
>
> "Sacha" <··@address.spam> wrote in message 
> ··························@phobos.telenet-ops.be...
>>
>> "Christophe Rhodes" <·····@cam.ac.uk> wrote in message 
>> ···················@cam.ac.uk...
>>> "Sacha" <··@address.spam> writes:
>>>
>>>> That's what i'm already doing, but when i read that into a string,
>>>> the string still is utf-8, while the parsing code i'm writing within
>>>> the lispworks editor is using latin-1.
>>>
>>> I'm no lispworks expert, but it would surprise me very much to learn
>>> that this was the case; I would expect that all lispworks strings are
>>> vectors of characters, not sequences of octets.
>>>
>>
>>
>> Allright, I made some more tests, and you're totaly right, I only had 
>> part
>> of the picture in mind.
>> My first tests were wrong i guess...
>
>
> FWIW here is some testing I did in LW out of curiosity. It seems the 
> difference between Latin-1 and UTF-8 is one of character type, viz. 
> simple-char vs. base-char.  In the code I copy a Latin-1 file to a UTF-8 
> version, then test the format type of the new file. Specifying 
> :element-type causes the :external-format to be correctly propagated.  Did 
> you specify both :element-type and :external-format in your code?
>
> Carl Taylor
>
> CL-USER 1 >
> (defun copy-latin-1-to-utf-8 ()
>  (let ((i-path #P"C:/Documents and Settings/Carl/my 
> documents/lisp/BP/06-2006.txt")
>        (o-path #P"C:/Documents and Settings/Carl/my 
> documents/lisp/utf8-test.txt"))
>    (with-open-file (i-stream i-path :direction :input
>                                     :element-type 'base-char 
> :external-format :latin-1)
>      (with-open-file (o-stream o-path :direction :output :external-format 
> :utf-8
>                                       :element-type 'simple-char 
> :if-exists :supersede)
>        (loop for x-char = (read-char i-stream nil nil nil)
>              while x-char
>                 do (write-char x-char o-stream)
>              finally
>                return (format t "~%~S~%~S~2%"
>                                 (stream-external-format i-stream)
>                                 (stream-external-format o-stream)))))))
> COPY-LATIN-1-TO-UTF-8
>
> CL-USER 2 >
> (compile *)
> COPY-LATIN-1-TO-UTF-8
> NIL
> NIL
>
>
> CL-USER 3 >
> (copy-latin-1-to-utf-8)
>
> (:LATIN-1 :EOL-STYLE :CRLF)
> (:UTF-8 :EOL-STYLE :CRLF)
>
> NIL
>
>
> CL-USER 4 >
> (with-open-file (i-stream
>                 #P"C:/Documents and Settings/Carl/my 
> documents/lisp/utf8-test.txt"
>                 :direction :input :external-format :utf-8 :element-type 
> 'simple-char)
>  (stream-external-format i-stream))
> (:UTF-8 :EOL-STYLE :CRLF)


Yep it's really working very nice and easy now, I was mistaken due to some 
bad testing.
Thanks for your time

Sacha