From: Mark Tarver
Subject: reading non-standard characters
Date: 
Message-ID: <1169045545.469151.218440@51g2000cwl.googlegroups.com>
I'm trying to read a .dbx file as a list of characters into Lisp. I get


invalid byte #x90 in CHARSET:CP1252 conversion

which signals, I guess, an attempt to read a character outside the
default character set.  Any good solutions?

Mark

From: Richard M Kreuter
Subject: Re: reading non-standard characters
Date: 
Message-ID: <871wlt3aik.fsf@progn.net>
"Mark Tarver" <··········@ukonline.co.uk> writes:

> I'm trying to read a .dbx file as a list of characters into Lisp. I get
>
>
> invalid byte #x90 in CHARSET:CP1252 conversion
>
> which signals, I guess, an attempt to read a character outside the
> default character set.  Any good solutions?

Three approaches are available, though the details are going to be
implementation-dependent:

* if the file is "really" a text file, find an external format that
  describes the file's encoding,

* if the file isn't a text file then if you know how the data is
  structured and if your implementation provides bivalent streams
  (streams on which you can call both read-byte and read-char), then
  try a bivalent stream,

* if the file isn't a text file and you don't know its structure,
  maybe try a binary stream and manually convert octet vectors to
  text.

You might find the flexi-streams library helpful, too:

http://weitz.de/flexi-streams/

--
RmK
From: Mark Tarver
Subject: Re: reading non-standard characters
Date: 
Message-ID: <1169051758.258923.229000@l53g2000cwa.googlegroups.com>
Thanks; the idea was to write a Qi program to enable me to order my
sprawling inbox into a neat pile of directories.  Ergo to save me time.
 As is often the case with computers - its taking me longer to
implement the 'automated' method.

I think I'll coerce the file into text format via Word and then read
the text file into Qi.

Mark

Richard M Kreuter wrote:
> "Mark Tarver" <··········@ukonline.co.uk> writes:
>
> > I'm trying to read a .dbx file as a list of characters into Lisp. I get
> >
> >
> > invalid byte #x90 in CHARSET:CP1252 conversion
> >
> > which signals, I guess, an attempt to read a character outside the
> > default character set.  Any good solutions?
>
> Three approaches are available, though the details are going to be
> implementation-dependent:
>
> * if the file is "really" a text file, find an external format that
>   describes the file's encoding,
>
> * if the file isn't a text file then if you know how the data is
>   structured and if your implementation provides bivalent streams
>   (streams on which you can call both read-byte and read-char), then
>   try a bivalent stream,
>
> * if the file isn't a text file and you don't know its structure,
>   maybe try a binary stream and manually convert octet vectors to
>   text.
>
> You might find the flexi-streams library helpful, too:
> 
> http://weitz.de/flexi-streams/
> 
> --
> RmK
From: Mark Tarver
Subject: DBX conversion was Re: reading non-standard characters
Date: 
Message-ID: <1169057130.851966.252040@l53g2000cwa.googlegroups.com>
No; that didn't work either.  I see there is a small industry for
convertig
DBX to text.  People selling the software for $20.

Any free DBX converters anyone?

Mark


Mark Tarver wrote:
> Thanks; the idea was to write a Qi program to enable me to order my
> sprawling inbox into a neat pile of directories.  Ergo to save me time.
>  As is often the case with computers - its taking me longer to
> implement the 'automated' method.
>
> I think I'll coerce the file into text format via Word and then read
> the text file into Qi.
>
> Mark
>
> Richard M Kreuter wrote:
> > "Mark Tarver" <··········@ukonline.co.uk> writes:
> >
> > > I'm trying to read a .dbx file as a list of characters into Lisp. I get
> > >
> > >
> > > invalid byte #x90 in CHARSET:CP1252 conversion
> > >
> > > which signals, I guess, an attempt to read a character outside the
> > > default character set.  Any good solutions?
> >
> > Three approaches are available, though the details are going to be
> > implementation-dependent:
> >
> > * if the file is "really" a text file, find an external format that
> >   describes the file's encoding,
> >
> > * if the file isn't a text file then if you know how the data is
> >   structured and if your implementation provides bivalent streams
> >   (streams on which you can call both read-byte and read-char), then
> >   try a bivalent stream,
> >
> > * if the file isn't a text file and you don't know its structure,
> >   maybe try a binary stream and manually convert octet vectors to
> >   text.
> >
> > You might find the flexi-streams library helpful, too:
> > 
> > http://weitz.de/flexi-streams/
> > 
> > --
> > RmK
From: D Herring
Subject: Re: DBX conversion was Re: reading non-standard characters
Date: 
Message-ID: <_7SdnUUuJd7T3C3YnZ2dnUVZ_hmtnZ2d@comcast.com>
Mark Tarver wrote:
> No; that didn't work either.  I see there is a small industry for
> converting DBX to text.  People selling the software for $20.
> 
> Any free DBX converters anyone?

http://www.pcworld.com/downloads/file/fid,23383-order,1-page,1-c,email/description.html
http://alioth.debian.org/projects/libpst/

See also ;-)
http://www.mozilla.com/en-US/thunderbird/
http://www.washington.edu/pine/

- Daniel
From: Pascal Bourguignon
Subject: Re: reading non-standard characters
Date: 
Message-ID: <87ps9dwutv.fsf@thalassa.informatimago.com>
"Mark Tarver" <··········@ukonline.co.uk> writes:

> I'm trying to read a .dbx file as a list of characters into Lisp. I get
>
>
> invalid byte #x90 in CHARSET:CP1252 conversion
>
> which signals, I guess, an attempt to read a character outside the
> default character set.  Any good solutions?

Open the .dbx file with an :external-format which specifies the right
encoding.  In what encoding is this file written?

That's assuming this is actually a TEXT file.
Perhaps it's a binary file?
Then open the .dbx file with the right :element-type '(unsigned-byte 8)



-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

NOTE: The most fundamental particles in this product are held
together by a "gluing" force about which little is currently known
and whose adhesive power can therefore not be permanently
guaranteed.
From: Mark Tarver
Subject: Re: reading non-standard characters
Date: 
Message-ID: <1169049520.053584.45340@a75g2000cwd.googlegroups.com>
.dbx files are read by Outlook Express.  I 'm not too sure on the
convention they use but MIME comes to mind.  I want to order my .dbx
files using Lisp.

I can convert .dbx to text but I still get read problems.  Using

(PRINT (STREAM-EXTERNAL-FORMAT In))

on the text, I get

#<ENCODING CHARSET:CP1252 :DOS>

Mark

Pascal Bourguignon wrote:
> "Mark Tarver" <··········@ukonline.co.uk> writes:
>
> > I'm trying to read a .dbx file as a list of characters into Lisp. I get
> >
> >
> > invalid byte #x90 in CHARSET:CP1252 conversion
> >
> > which signals, I guess, an attempt to read a character outside the
> > default character set.  Any good solutions?
>
> Open the .dbx file with an :external-format which specifies the right
> encoding.  In what encoding is this file written?
>
> That's assuming this is actually a TEXT file.
> Perhaps it's a binary file?
> Then open the .dbx file with the right :element-type '(unsigned-byte 8)
>
>
>
> --
> __Pascal Bourguignon__                     http://www.informatimago.com/
>
> NOTE: The most fundamental particles in this product are held
> together by a "gluing" force about which little is currently known
> and whose adhesive power can therefore not be permanently
> guaranteed.
From: Pascal Bourguignon
Subject: Re: reading non-standard characters
Date: 
Message-ID: <87ejptwm65.fsf@thalassa.informatimago.com>
"Mark Tarver" <··········@ukonline.co.uk> writes:
> Pascal Bourguignon wrote:
>> "Mark Tarver" <··········@ukonline.co.uk> writes:
>>
>> > I'm trying to read a .dbx file as a list of characters into Lisp. I get
>> >
>> >
>> > invalid byte #x90 in CHARSET:CP1252 conversion
>> >
>> > which signals, I guess, an attempt to read a character outside the
>> > default character set.  Any good solutions?
>>
>> Open the .dbx file with an :external-format which specifies the right
>> encoding.  In what encoding is this file written?
>>
>> That's assuming this is actually a TEXT file.
>> Perhaps it's a binary file?
>> Then open the .dbx file with the right :element-type '(unsigned-byte 8)
>
> .dbx files are read by Outlook Express.  

Of which the only thing I know is that it has some bug in encoding
replies to Mail.app, which makes my MacOSX users complain for
something I can do nothing about.

A dump of a .dbx file, or a description of what they contain would be
more useful.


> I 'm not too sure on the convention they use but MIME comes to mind.
> I want to order my .dbx files using Lisp.

If they have MIME headers in the file, then you can use my
SAFE-TEXT-FILE-TO-STRING-LIST function to read the header, find the
Content-Type: header, parse its value and extract the charset name.

http://darcs.informatimago.com/darcs/public/lisp/common-lisp/file.lisp

SAFE-TEXT-FILE-TO-STRING-LIST (PATH &key (if-does-not-exist :error))
  "
DO:     - Read the file at PATH as a binary file,
        - Remove all null bytes (handle UTF-16, UCS-2, etc),
        - Split 'lines' on CR, CR+LF or LF,
        - Replace all bytes less than 32 or greater than 126 by #\?,
        - Convert the remaining bytes as ASCII codes into the CL standard set.
RETURN: The contents of the file as a list of base-string lines.
"


Then fetch IANA character set registrations list at
shttp://www.iana.org/assignments/charset-reg/ and build a map between
these charset names and the names of the charsets in the CHARSET
package in clisp, or the keywords naming charsets in sbcl, (etc for
the other implementations), map the MIME charset to the native charset
token which can used as an external-formal, and reopen the file with
that charset/external-format.



On the other hand, if the dbx file may contain ranges encoded in
different charsets, you'll have to open it as binary file, and
explicitely convert the bytes to characters (if you need that!) using
the implementation specific functions such as #+clisp
ext:convert-string-from-bytes and #+sbcl sb-ext:octets-to-string.


http://www.cliki.net/CloserLookAtCharacters
-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

"Debugging?  Klingons do not debug! Our software does not coddle the
weak."