Hello all,
I have this UTF-8 encoded file which i need to parse.
I've been able to load the file as UTF-8, and put it in an UTF-8 string, but
that doesn't help with the parsing.
Using lispworks, it seems that the editor uses LATIN-1 encoding.
So that means the source code and strings defined within are latin-1
encoded.
In order to do proper parsing i beleive i'll need to convert that UTF-8 file
to LATIN-1.
I've been googling but didn't find anything that would help.
Thanks in advance
Sacha
Sacha wrote:
> I have this UTF-8 encoded file which i need to parse.
> I've been able to load the file as UTF-8, and put it in an UTF-8 string, but
> that doesn't help with the parsing.
>
> Using lispworks, it seems that the editor uses LATIN-1 encoding.
> So that means the source code and strings defined within are latin-1
> encoded.
Would Flexi-streams do the job?
http://weitz.de/flexi-streams/
Tayssir
"Tayssir John Gabbour" <···········@yahoo.com> wrote in message
·····························@m79g2000cwm.googlegroups.com...
> Sacha wrote:
>> I have this UTF-8 encoded file which i need to parse.
>> I've been able to load the file as UTF-8, and put it in an UTF-8 string,
>> but
>> that doesn't help with the parsing.
>>
>> Using lispworks, it seems that the editor uses LATIN-1 encoding.
>> So that means the source code and strings defined within are latin-1
>> encoded.
>
> Would Flexi-streams do the job?
> http://weitz.de/flexi-streams/
>
> Tayssir
>
I'm currently trying to figure out how to make this work ...
In the meantime i'll use this simple function :
(defun ->latin-1 (string)
(format t "~%Converting to Latin-1...")
(let ((latin-1-string (make-string (length string)))
(unrecognized-chars nil))
(loop for char across string
for index from 0
do (setf (aref latin-1-string index)
(let ((new-char (find-external-char (char-code char)
:latin-1)))
(if new-char
new-char
(progn
(push char unrecognized-chars)
··@)))))
(values latin-1-string
unrecognized-chars)))
Thank you for your time
Sacha
Sacha wrote:
> Hello all,
>
> I have this UTF-8 encoded file which i need to parse.
> I've been able to load the file as UTF-8, and put it in an UTF-8
> string, but that doesn't help with the parsing.
>
> Using lispworks, it seems that the editor uses LATIN-1 encoding.
> So that means the source code and strings defined within are latin-1
> encoded.
>
> In order to do proper parsing i beleive i'll need to convert that
> UTF-8 file to LATIN-1.
Just for the hell of it try this:
( ....
(with-open-file (in-stream file-path :direction :input :external-format
:utf-8)
(other-stuff ....... )))))
I don't have any utf-8 files, but when I try to read a latin-1 file in
LispWorks with the above code I get the following break, so it seems LW
knows about different formats.
Error: External format (:UTF-8 :EOL-STYLE :CRLF) produces characters of type
SIMPLE-CHAR, which is not a subtype of the specified element-type BASE-CHAR.
Carl Taylor
"Carl Taylor" <··········@att.net> wrote in message
····························@bgtnsc04-news.ops.worldnet.att.net...
> Sacha wrote:
>> Hello all,
>> In order to do proper parsing i beleive i'll need to convert that
>> UTF-8 file to LATIN-1.
>
>
> Just for the hell of it try this:
>
> ( ....
> (with-open-file (in-stream file-path :direction :input :external-format
> :utf-8)
> (other-stuff ....... )))))
>
That's what i'm already doing, but when i read that into a string, the
string
still is utf-8, while the parsing code i'm writing within the lispworks
editor is
using latin-1. Comparing string literals across character encodings doesn't
work, that's why i need to convert the string.
Sacha
From: Christophe Rhodes
Subject: Re: converting from say UTF-8 to LATIN-1
Date:
Message-ID: <sq64h8chet.fsf@cam.ac.uk>
"Sacha" <··@address.spam> writes:
> That's what i'm already doing, but when i read that into a string,
> the string still is utf-8, while the parsing code i'm writing within
> the lispworks editor is using latin-1.
I'm no lispworks expert, but it would surprise me very much to learn
that this was the case; I would expect that all lispworks strings are
vectors of characters, not sequences of octets.
> Comparing string literals across character encodings doesn't work,
> that's why i need to convert the string.
Again, this doesn't sound right to me. Are you sure that this is what
is going on?
Christophe
"Christophe Rhodes" <·····@cam.ac.uk> wrote in message
···················@cam.ac.uk...
> "Sacha" <··@address.spam> writes:
>
>> That's what i'm already doing, but when i read that into a string,
>> the string still is utf-8, while the parsing code i'm writing within
>> the lispworks editor is using latin-1.
>
> I'm no lispworks expert, but it would surprise me very much to learn
> that this was the case; I would expect that all lispworks strings are
> vectors of characters, not sequences of octets.
>
Allright, I made some more tests, and you're totaly right, I only had part
of the picture in mind.
My first tests were wrong i guess...
Thanks a lot !
Sacha
PS : hum sorry for replying to your email address =/
"Sacha" <··@address.spam> wrote in message
··························@phobos.telenet-ops.be...
>
> "Christophe Rhodes" <·····@cam.ac.uk> wrote in message ···················@cam.ac.uk...
>> "Sacha" <··@address.spam> writes:
>>
>>> That's what i'm already doing, but when i read that into a string,
>>> the string still is utf-8, while the parsing code i'm writing within
>>> the lispworks editor is using latin-1.
>>
>> I'm no lispworks expert, but it would surprise me very much to learn
>> that this was the case; I would expect that all lispworks strings are
>> vectors of characters, not sequences of octets.
>>
>
>
> Allright, I made some more tests, and you're totaly right, I only had part
> of the picture in mind.
> My first tests were wrong i guess...
FWIW here is some testing I did in LW out of curiosity. It seems the difference between
Latin-1 and UTF-8 is one of character type, viz. simple-char vs. base-char. In the code I
copy a Latin-1 file to a UTF-8 version, then test the format type of the new file.
Specifying :element-type causes the :external-format to be correctly propagated. Did you
specify both :element-type and :external-format in your code?
Carl Taylor
CL-USER 1 >
(defun copy-latin-1-to-utf-8 ()
(let ((i-path #P"C:/Documents and Settings/Carl/my documents/lisp/BP/06-2006.txt")
(o-path #P"C:/Documents and Settings/Carl/my documents/lisp/utf8-test.txt"))
(with-open-file (i-stream i-path :direction :input
:element-type 'base-char :external-format :latin-1)
(with-open-file (o-stream o-path :direction :output :external-format :utf-8
:element-type 'simple-char :if-exists :supersede)
(loop for x-char = (read-char i-stream nil nil nil)
while x-char
do (write-char x-char o-stream)
finally
return (format t "~%~S~%~S~2%"
(stream-external-format i-stream)
(stream-external-format o-stream)))))))
COPY-LATIN-1-TO-UTF-8
CL-USER 2 >
(compile *)
COPY-LATIN-1-TO-UTF-8
NIL
NIL
CL-USER 3 >
(copy-latin-1-to-utf-8)
(:LATIN-1 :EOL-STYLE :CRLF)
(:UTF-8 :EOL-STYLE :CRLF)
NIL
CL-USER 4 >
(with-open-file (i-stream
#P"C:/Documents and Settings/Carl/my documents/lisp/utf8-test.txt"
:direction :input :external-format :utf-8 :element-type 'simple-char)
(stream-external-format i-stream))
(:UTF-8 :EOL-STYLE :CRLF)
"Carl Taylor" <··········@att.net> wrote in message
····························@bgtnsc05-news.ops.worldnet.att.net...
>
> "Sacha" <··@address.spam> wrote in message
> ··························@phobos.telenet-ops.be...
>>
>> "Christophe Rhodes" <·····@cam.ac.uk> wrote in message
>> ···················@cam.ac.uk...
>>> "Sacha" <··@address.spam> writes:
>>>
>>>> That's what i'm already doing, but when i read that into a string,
>>>> the string still is utf-8, while the parsing code i'm writing within
>>>> the lispworks editor is using latin-1.
>>>
>>> I'm no lispworks expert, but it would surprise me very much to learn
>>> that this was the case; I would expect that all lispworks strings are
>>> vectors of characters, not sequences of octets.
>>>
>>
>>
>> Allright, I made some more tests, and you're totaly right, I only had
>> part
>> of the picture in mind.
>> My first tests were wrong i guess...
>
>
> FWIW here is some testing I did in LW out of curiosity. It seems the
> difference between Latin-1 and UTF-8 is one of character type, viz.
> simple-char vs. base-char. In the code I copy a Latin-1 file to a UTF-8
> version, then test the format type of the new file. Specifying
> :element-type causes the :external-format to be correctly propagated. Did
> you specify both :element-type and :external-format in your code?
>
> Carl Taylor
>
> CL-USER 1 >
> (defun copy-latin-1-to-utf-8 ()
> (let ((i-path #P"C:/Documents and Settings/Carl/my
> documents/lisp/BP/06-2006.txt")
> (o-path #P"C:/Documents and Settings/Carl/my
> documents/lisp/utf8-test.txt"))
> (with-open-file (i-stream i-path :direction :input
> :element-type 'base-char
> :external-format :latin-1)
> (with-open-file (o-stream o-path :direction :output :external-format
> :utf-8
> :element-type 'simple-char
> :if-exists :supersede)
> (loop for x-char = (read-char i-stream nil nil nil)
> while x-char
> do (write-char x-char o-stream)
> finally
> return (format t "~%~S~%~S~2%"
> (stream-external-format i-stream)
> (stream-external-format o-stream)))))))
> COPY-LATIN-1-TO-UTF-8
>
> CL-USER 2 >
> (compile *)
> COPY-LATIN-1-TO-UTF-8
> NIL
> NIL
>
>
> CL-USER 3 >
> (copy-latin-1-to-utf-8)
>
> (:LATIN-1 :EOL-STYLE :CRLF)
> (:UTF-8 :EOL-STYLE :CRLF)
>
> NIL
>
>
> CL-USER 4 >
> (with-open-file (i-stream
> #P"C:/Documents and Settings/Carl/my
> documents/lisp/utf8-test.txt"
> :direction :input :external-format :utf-8 :element-type
> 'simple-char)
> (stream-external-format i-stream))
> (:UTF-8 :EOL-STYLE :CRLF)
Yep it's really working very nice and easy now, I was mistaken due to some
bad testing.
Thanks for your time
Sacha