From: Peter Schupp
Subject: Umlaute (ascii > 127?) within external files
Date: 
Message-ID: <39E48166.EAD050D3@object-it.de>
Hi,

unfortunately I've got a real german problem: does somebody know how to
treat Umlaute
��� ��� etc?
I like to read strings from an external file (using "with-open-file" and
"read-line") and move
all words (separated by blanks) to a list. Afterwards i like to use the
lists entries to compare
them to components.

When I read the string from an external file and move it one by one
character to a list I can
see, that the Umlaute  (���) are read as double byte chars.


This may be the content of my external file
-----------------------------------------------------
Arme H�nde F��e Beine
Ohren Nase Test K�he
123 abc def ghi


This is my test code
---------------------------

(defun test-read-char-separated-file (v_in-file v_col-sep)
(let (
      v_list
   )

   ; ----- testing and debugging purposes -----
   (log-message (format () "\n--- function test-read-char-separated-file
started ---\n"))

   ; open file and read lines
   (with-open-file (v_in-stream v_in-file :direction :input)
      (while (not (eql (setq v_line (read-line v_in-stream)) 'eof))
         (setq v_pos 0)
         (log-message (format () ": ~A" (string v_line)))
         ; loop as long as pointer reaches the end of input string
         (while (< v_pos (string-length v_line))
            (log-message (format () "char ~D : ~A" v_pos (char v_line
v_pos)))
            (push (char v_line v_pos) v_list)
            ; move on to next position
            (inc v_pos)
         )
      )
   )

   ; reverse the result list v_list
   (setq v_list (reverse v_list))
   ; ----- test and debug
   (log-message (format () "\n--- function test-read-ret-file finished
---\n"))
   ; return the list
   v_list
))


--
_______________________________________________________________________

 ···················@object-it.de               Privat:

 STZ object-IT        Tel. 0711 18 39 78 6   |  Kirchheimer Str. 18
 Postfach 10 43 62    Fax  0711 18 39 68 7   |  73760 Ostfildern-Ruit
 D-70038 Stuttgart    D2   0172 9 06 71 62   |  Tel 0711 44 16 06 5

 PGP Key available at: http://wwwkeys.de.pgp.net
_______________________________________________________________________

From: Rainer Joswig
Subject: Re: Umlaute (ascii > 127?) within external files
Date: 
Message-ID: <joswig-4544F4.18102311102000@news.is-europe.net>
In article <·················@object-it.de>, ············@object-it.de 
wrote:

> Hi,
> 
> unfortunately I've got a real german problem: does somebody know how to
> treat Umlaute
> ��� ��� etc?

This "problem" exists for a lot of languages.

You didn't tell us which OS and which Lisp you are using.

> I like to read strings from an external file (using "with-open-file" and
> "read-line") and move
> all words (separated by blanks) to a list. Afterwards i like to use the
> lists entries to compare
> them to components.
> 
> When I read the string from an external file and move it one by one
> character to a list I can
> see, that the Umlaute  (���) are read as double byte chars.

So what is the problem?

> 
> 
> This may be the content of my external file
> -----------------------------------------------------
> Arme H�nde F��e Beine
> Ohren Nase Test K�he
> 123 abc def ghi
> 
> 
> This is my test code
> ---------------------------
> 
> (defun test-read-char-separated-file (v_in-file v_col-sep)
> (let (
>       v_list
>    )
> 
>    ; ----- testing and debugging purposes -----
>    (log-message (format () "\n--- function test-read-char-separated-file
> started ---\n"))
> 
>    ; open file and read lines
>    (with-open-file (v_in-stream v_in-file :direction :input)
>       (while (not (eql (setq v_line (read-line v_in-stream)) 'eof))
>          (setq v_pos 0)
>          (log-message (format () ": ~A" (string v_line)))
>          ; loop as long as pointer reaches the end of input string
>          (while (< v_pos (string-length v_line))
>             (log-message (format () "char ~D : ~A" v_pos (char v_line
> v_pos)))
>             (push (char v_line v_pos) v_list)
>             ; move on to next position
>             (inc v_pos)
>          )
>       )
>    )
> 
>    ; reverse the result list v_list
>    (setq v_list (reverse v_list))
>    ; ----- test and debug
>    (log-message (format () "\n--- function test-read-ret-file finished
> ---\n"))
>    ; return the list
>    v_list
> ))

Try stuff like this (use one of the SPLIT-STRING functions
that have been posted to comp.lang.lisp recently).

(defun read-file-as-delimited-lines (stream column-character)
  (loop for line = (read-line stream nil nil)
        while line
        collect (ccl::split-string line :item column-character)))

(defun test (string)
  (with-input-from-string (stream string)
    (read-file-as-delimited-lines stream #\space)))

(test "Arme H�nde F��e Beine
Ohren Nase Test K�he
123 abc def ghi")

-> (("Arme" "H�nde" "F��e" "Beine") ("Ohren" "Nase" "Test" "K�he") ("123" "abc" "def" "ghi"))

-- 
Rainer Joswig, Hamburg, Germany
Email: ·············@corporate-world.lisp.de
Web: http://corporate-world.lisp.de/
From: Lieven Marchand
Subject: Re: Umlaute (ascii > 127?) within external files
Date: 
Message-ID: <m3aecbz6kp.fsf@localhost.localdomain>
Peter Schupp <············@object-it.de> writes:

> Hi,
> 
> unfortunately I've got a real german problem: does somebody know how to
> treat Umlaute
> ��� ��� etc?

What implementation are you using? This stuff is still fairly
implementation specific. One possibility is to look in your vendor
documentation for possible values for the :EXTERNAL-FORMAT keyword
argument for OPEN.

-- 
Lieven Marchand <···@bewoner.dma.be>
Lambda calculus - Call us a mad club
From: Pierre R. Mai
Subject: Re: Umlaute (ascii > 127?) within external files
Date: 
Message-ID: <87bswqp7dd.fsf@orion.bln.pmsf.de>
Peter Schupp <············@object-it.de> writes:

> When I read the string from an external file and move it one by one
> character to a list I can
> see, that the Umlaute  (ÄÖÜ) are read as double byte chars.

This would seem to indicate that the external file is encoded in
something like UTF-8 (i.e. Unicode in a varying-byte representation).
This is most likely not what you want.  What you probably really want
is ISO Latin-1 encoding, or the Windows mangling of said encoding.

You should probably first check whether the external file is in UTF-8
encoding.  Furthermore you need to check which implementation of Lisp
you are using, and what kinds of external formats it supports (see
documentation of OPEN).  If you are in luck, then it might support
reading UTF-8 directly.

If you are not in luck, then you might want to convert the external
file into Latin-1 beforehand, or you'll have to do the conversion in
Lisp, like e.g. so:

(defun convert-utf8-to-latin1 (string)
  (declare (string string) (optimize (speed 3)))
  (with-output-to-string (stream)
    (let ((length (length string))
          (index 0))
      (declare (fixnum length index))
      (loop
        (unless (< index length) (return nil))
        (let* ((char (char string index))
               (code (char-code char)))
          (cond
            ((< code #x80) ; ASCII
             (write-char char stream)
             (incf index 1))
            ((< code #xC0) 
             ;; We are in the middle of a multi-byte sequence!
             ;; This should never happen, so we raise an error.
             (error "Encountered illegal multi-byte sequence."))
            ((< code #xC4)
             ;; Two byte sequence in Latin-1 range
             (unless (< (1+ index) length)
               (error "Encountered incomplete two-byte sequence."))
             (let* ((char2 (char string (1+ index)))
                    (code2 (char-code char2)))
               (unless (and (logbitp 7 code2) (not (logbitp 6 code2)))
                 (error "Second byte in sequence is not a continuation."))
	       (let* ((upper-bits (ldb (byte 2 0) code))
                      (lower-bits (ldb (byte 6 0) code2))
                      (new-code (dpb upper-bits (byte 2 6) lower-bits)))
                 (write-char (code-char new-code) stream)))
             (incf index 2))
            ((>= code #xFE)
             ;; Ignore stray byte-order markers
             (incf index 1))
            (t
             (error "Multi-byte sequence outside Latin-1 range."))))))))

Note that this is not in any way the most efficient way to do the
conversion (a table driven approach would probably work best).  It
also relies on the charset of the implementation being Latin-1 and
char-code/code-char being implemented accordingly.

But it should suffice to test if you are indeed dealing with UTF-8 or
not...

Regs, Pierre.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein
From: Peter Schupp
Subject: Re: Umlaute (ascii > 127?) within external files
Date: 
Message-ID: <39E55F46.DC323597@object-it.de>
thanks for your support so far and further inquiry...

This is my environment:

I'm using Interleaf Lisp within Quicksilver 7 - a DTP System which uses -
for sure - a very special implementation of Lisp. I run it on an Intel
WindowsNT 4.x/Windows 98 system.

Peter