Extended-char -> Standard-char

From: Don Geddis
Subject: Extended-char -> Standard-char
Date: Mon, 09 Sep 2002 02:10:22 +0000
Message-ID: <m3n0qrkj5d.fsf@maul.geddis.org>

If I have a string that includes some European accented characters, is there
a (portable, simple) way to convert each such character to its corresponding
unaccented version?

I was hoping for something along the lines of STRING-DOWNCASE, which does
the mapping from uppercase to corresponding lowercase characters.  It would
be great if something like this worked:
        (string-simplify "���") -> "aei"

Any suggestions?  The best I've come up with is building my own manual map
character by character.  Or perhaps looking at the character codes, and doing
arithmetic (assuming they're all contiguous).  I was hoping this kind of
transformation might be built in to Common Lisp, but was unable to find the
appropriate function.

        -- Don
_______________________________________________________________________________
Don Geddis                    http://don.geddis.org              ···@geddis.org

Re: Extended-char -> Standard-char Joe Marshall
Re: Extended-char -> Standard-char Adam Warner
Re: Extended-char -> Standard-char Adam Warner
Re: Extended-char -> Standard-char Arthur Lemmens
- Re: Extended-char -> Standard-char Gisle Sælensminde
  - Re: Extended-char -> Standard-char Don Geddis
    - Re: Extended-char -> Standard-char Nils Goesche
    - Re: Extended-char -> Standard-char Aleksandr Skobelev
  - Re: Extended-char -> Standard-char Hannah Schroeter
    - Re: Extended-char -> Standard-char Tim Bradshaw

From: Joe Marshall
Subject: Re: Extended-char -> Standard-char
Date: Mon, 09 Sep 2002 14:14:35 +0000
Message-ID: <y9abuua9.fsf@ccs.neu.edu>

Don Geddis <···@geddis.org> writes:

> If I have a string that includes some European accented characters, is there
> a (portable, simple) way to convert each such character to its corresponding
> unaccented version?
> 
> I was hoping for something along the lines of STRING-DOWNCASE, which does
> the mapping from uppercase to corresponding lowercase characters.  It would
> be great if something like this worked:
>         (string-simplify "���") -> "aei"
> 
> Any suggestions?  

Look at http://www.unicode.org/unicode/reports/tr15/
under `Compatibility Equivalence'

From: Adam Warner
Subject: Re: Extended-char -> Standard-char
Date: Mon, 09 Sep 2002 03:00:09 +0000
Message-ID: <alh2io$1q9uns$1@ID-105510.news.dfncis.de>

Hi Don Geddis,

> If I have a string that includes some European accented characters, is there
> a (portable, simple) way to convert each such character to its corresponding
> unaccented version?
> 
> I was hoping for something along the lines of STRING-DOWNCASE, which does
> the mapping from uppercase to corresponding lowercase characters.  It would
> be great if something like this worked:
>         (string-simplify "���") -> "aei"

I was suprised the other day to stumble across this: CLISP knows the full name
of extended characters! Examples:

[1]> #\�
#\LATIN_SMALL_LETTER_A_WITH_TILDE
[2]> #\�
#\LATIN_SMALL_LETTER_E_WITH_ACUTE
[3]> #\�
#\LATIN_SMALL_LETTER_I_WITH_ACUTE

So long as this information is available in your implementation you should
be able to extract the letter from the long name.

Personally I'm standardising upon Unicode+UTF-8 encoding which
accommodates an astonishing array of characters and symbols while
being ASCII backwards compatible. UTF-8 is a wonderful encoding design.

Regards,
Adam

From: Adam Warner
Subject: Re: Extended-char -> Standard-char
Date: Mon, 09 Sep 2002 03:01:48 +0000
Message-ID: <alh2lr$1q9uns$2@ID-105510.news.dfncis.de>

Hi Don Geddis,

> If I have a string that includes some European accented characters, is there
> a (portable, simple) way to convert each such character to its corresponding
> unaccented version?
> 
> I was hoping for something along the lines of STRING-DOWNCASE, which does
> the mapping from uppercase to corresponding lowercase characters.  It would
> be great if something like this worked:
>         (string-simplify "���") -> "aei"

I was surprised the other day to stumble across this: CLISP knows the full name
of extended characters! Examples:

[1]> #\�
#\LATIN_SMALL_LETTER_A_WITH_TILDE
[2]> #\�
#\LATIN_SMALL_LETTER_E_WITH_ACUTE
[3]> #\�
#\LATIN_SMALL_LETTER_I_WITH_ACUTE

So long as this information is available in your implementation you should
be able to extract the letter from the long name.

Personally I'm standardising upon Unicode+UTF-8 encoding which
accommodates an astonishing array of characters and symbols while
being ASCII backwards compatible. UTF-8 is a wonderful encoding design.

Regards,
Adam

From: Arthur Lemmens
Subject: Re: Extended-char -> Standard-char
Date: Mon, 09 Sep 2002 19:56:40 +0000
Message-ID: <3D7CFCF7.DD4507E5@xs4all.nl>

Don Geddis wrote:
 
> If I have a string that includes some European accented characters, is there
> a (portable, simple) way to convert each such character to its corresponding
> unaccented version?

Below is a quick hack that I've used a few times to interface with systems that
handle only ASCII characters. (This won't make much sense if you're not reading
this with an ISO-8859-1 character set.) As for portability: you can't really 
expect that when you're dealing with non-standard characters.


 Arthur Lemmens

 ;;;;;;;;;;;;;;;;

(defun asciify (string &key (default :skip))
  "Returns a string containing only ASCII characters. Non-ASCII characters in
the input string will be replaced by something resembling the original, if
possible. Otherwise, they will be replaced by DEFAULT (or skipped, when DEFAULT
is :SKIP)."
  (let ((specials '(("�������" #\a)
                    ("�������" #\A)
                    ("����" #\e)
                    ("����" #\E)
                    ("����" #\i)
                    ("����" #\I)
                    ("�����" #\o)
                    ("�����" #\O)
                    ("����" #\u)
                    ("����" #\U)
                    ("�" #\y)
                    ("�" #\Y)
                    ("�" #\c)
                    ("�" #\C)
                    ("�" #\n)
                    ("�" #\N)
                    ("�" #\d)
                    ("�" #\D))))
    (with-output-to-string (result)
      (loop for char across string
            do (let ((code (char-code char)))
                 (if (<= code 127)
                     (write-char char result)
                   (let ((rule (find-if (lambda (rule)
                                          (find char (first rule)))
                                        specials)))
                     (if rule
                         (write-char (second rule) result)
                       (unless (eq default :skip)
                         (write-char default result))))))))))


Here are a few examples:

CL-USER 6 >  (asciify "Jos� �rbol ni�o")
"Jose arbol nino"

CL-USER 7 > (asciify "�no!" :default :skip)
"no!"

CL-USER 8 > (asciify "�no!" :default #\!)
"!no!"

From: Gisle Sælensminde
Subject: Re: Extended-char -> Standard-char
Date: Tue, 10 Sep 2002 15:46:46 +0000
Message-ID: <slrnans4v6.1vf.gisle@apal.ii.uib.no>

In article <·················@xs4all.nl>, Arthur Lemmens wrote:
> 
> Don Geddis wrote:
>  
>> If I have a string that includes some European accented characters, is there
>> a (portable, simple) way to convert each such character to its corresponding
>> unaccented version?
> 
> Below is a quick hack that I've used a few times to interface with systems that
> handle only ASCII characters. (This won't make much sense if you're not reading
> this with an ISO-8859-1 character set.) As for portability: you can't really 
> expect that when you're dealing with non-standard characters.

The concept of just converting "accented characters" to "unaccented"
characters is misunderstood. In some cases you can do this, but some
of these are considered as letters. This includes the Norwegian/Danish
letters �, � and �, and the Icelandic letter '�' (eth). In the latter
case your translation to a 'd' is misunderstood, since the letter is
pronounced more like 'th' if i rememer right. Also, the accents is neccesary
to preserve the meaning of the words in some cases.

The 'correct' way of doing this is often to replace the letter
with groups of others, eg:

� => aa
� => oe
� => ae
� => ss

So your quick hack is not the right way.

--
Gisle S�lensminde ( ·····@ii.uib.no )   

With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going
to land, and it could be dangerous sitting under them as they fly
overhead. (from RFC 1925)

From: Don Geddis
Subject: Re: Extended-char -> Standard-char
Date: Tue, 10 Sep 2002 18:18:18 +0000
Message-ID: <m365xd909h.fsf@maul.geddis.org>

Thanks much to everyone for the code suggestions on converting accented
characters to unaccented ones.  I've assembled the code and ideas from
Arthur Lemmens, Paul Foley, and Gisle S�lensminde into the final (?) code at
        http://don.geddis.org/lisp/asciify.lisp
and also appended below.

Much appreciated.  This is exactly the solution I needed.
_______________________________________________________________________________
Don Geddis                    http://don.geddis.org              ···@geddis.org
To me, boxing is like ballet, except there's no music, no choreography, and the
dancers hit each other.
	-- Deep Thoughts, by Jack Handey [1999]

;;;----------------------------------------------------------------------------
;;;
;;; ASCIIFY
;;;
;;; Convert a string with European accented characters, into 7-bit ASCII.
;;;
;;; Original code by Arthur Lemmens <········@xs4all.nl>.
;;; Improved (multi-char rewriting) by Paul Foley <·······@actrix.gen.nz>.
;;; Additional suggestions by Gisle S�lensminde <·····@apal.ii.uib.no>.
;;; Final assembly by Don Geddis <···@geddis.org>.
;;;
;;; Available at http://don.geddis.org/lisp/asciify.lisp
;;;
;;;----------------------------------------------------------------------------
;;;
;;; Examples:
;;; USER> (asciify "Jos� �rbol ni�o")
;;; "Jose arbol nino"
;;; USER> (asciify "�no!" :default :skip)
;;; "no!"
;;; USER> (asciify "�no!" :default #\!)
;;; "!no!"
;;;
;;; [Note: the last example no longer works, because "�" is now part of the
;;;  built-in map.  But it should work for some other unknown character.]
;;;
;;;----------------------------------------------------------------------------

(defparameter *accent-rewrites*
  '(
    ("�����" . #\a)
    ("�����" . #\A)
    ("����"  . #\e)
    ("����"  . #\E)
    ("����"  . #\i)
    ("����"  . #\I)
    ("�����" . #\o)
    ("�����" . #\O)
    ("����"  . #\u)
    ("����"  . #\U)
    ("�"     . #\y)
    ("�"     . #\Y)
    ("�"     . #\c)
    ("�"     . #\C)
    ("�"     . #\n)
    ("�"     . #\N)
    ("��"    . "th")
    ("��"    . "Th")
    ("�"     . "ij")
    ("�"     . "aa")
    ("�"     . "Aa")
    ("�"     . "ae")
    ("�"     . "Ae")
    ("�"     . "ss")
    ("�"     . "oe")
    ("��"    . :skip)
    ))

;;;----------------------------------------------------------------------------

(defun asciify (string &key (default :skip))
  "Returns a string containing only ASCII characters.  Non-ASCII characters in
the input string will be replaced by something resembling the original, if
possible.  Otherwise, they will be replaced by DEFAULT (or skipped, when
DEFAULT is :SKIP)."

  (with-output-to-string (result)
     (loop for char across string
	   if (char<= char #\Delete)
	   do
	   (write-char char result)
	   else do
	   (let ((replacement (or (cdr (assoc char *accent-rewrites*
					      :test #'position ))
				  default
				  char ))) ; keep it if DEFAULT is NIL
	     (unless (eq replacement :skip)
	       (princ replacement result) )
	     ))))

;;;----------------------------------------------------------------------------

From: Nils Goesche
Subject: Re: Extended-char -> Standard-char
Date: Tue, 10 Sep 2002 18:24:20 +0000
Message-ID: <lkofb5bt4b.fsf@pc022.bln.elmeg.de>

Don Geddis <···@geddis.org> writes:

> (defparameter *accent-rewrites*
>   '(
>     ("�����" . #\a)

[snip]

If you also want to be kind to the Germans, change that to

(defparameter *accent-rewrites*
  '(
    ("����" . #\a)
    ("����" . #\A)
    ("����"  . #\e)
    ("����"  . #\E)
    ("����"  . #\i)
    ("����"  . #\I)
    ("����" . #\o)
    ("����" . #\O)
    ("���"  . #\u)
    ("���"  . #\U)
    ("�"     . #\y)
    ("�"     . #\Y)
    ("�"     . #\c)
    ("�"     . #\C)
    ("�"     . #\n)
    ("�"     . #\N)
    ("��"    . "th")
    ("��"    . "Th")
    ("�"     . "ij")
    ("�"     . "aa")
    ("�"     . "ue")
    ("�"     . "Ue")
    ("�"     . "Aa")
    ("�"     . "Oe")
    ("��"     . "ae")
    ("��"     . "Ae")
    ("�"     . "ss")
    ("��"     . "oe")
    ("��"    . :skip)
    ))

Regards,
-- 
Nils Goesche
"Don't ask for whom the <CTRL-G> tolls."

PGP key ID 0x0655CFA0

From: Aleksandr Skobelev
Subject: Re: Extended-char -> Standard-char
Date: Wed, 11 Sep 2002 07:19:13 +0000
Message-ID: <m3y9a9t2mm.fsf@list.ru>

Don Geddis <···@geddis.org> writes:

> _______________________________________________________________________
> Don Geddis            http://don.geddis.org              ···@geddis.org
> To me, boxing is like ballet, except there's no music, no choreography,
> and the dancers hit each other.
> 	-- Deep Thoughts, by Jack Handey [1999]

Sorry for Ot, but I can not help telling that recently I was so impressed
by fight between Jonny Tapia and ??? Rodriges, that record was showed on
TV. Really it was a ballet and just a great performance!

From: Hannah Schroeter
Subject: Re: Extended-char -> Standard-char
Date: Tue, 10 Sep 2002 17:49:51 +0000
Message-ID: <allbbv$vqf$1@c3po.schlund.de>

Hello!

Gisle S�lensminde  <·····@apal.ii.uib.no> wrote:
>[...]

>The concept of just converting "accented characters" to "unaccented"
>characters is misunderstood. In some cases you can do this, but some
>of these are considered as letters. This includes the Norwegian/Danish
>letters �, � and �, and the Icelandic letter '�' (eth). In the latter
>case your translation to a 'd' is misunderstood, since the letter is
>pronounced more like 'th' if i rememer right. Also, the accents is neccesary
>to preserve the meaning of the words in some cases.

In fact, there's � (like an un-voiced English th, i.e. as in "bath"),
� (like a voiced English th, as in "this"). In addition, Icelandic
has � and � as letters with their own positions in the alphabet,
as well as �, �, �, �, �, � where the accent may carry a semantic
difference. And all of them in upcase, too.

However, natural language redundancy often allows for reading
mutilated text, such as Icelandic with �->a, ..., � -> o, �/� -> th
or Turkish with � -> c, s-with-tail -> s, i-without-dot -> i,
and so on. But it isn't really nice compared to less lossy
transliterations.

>The 'correct' way of doing this is often to replace the letter
>with groups of others, eg:

>� => aa
>� => oe
>� => ae
>� => ss

Yeah right. Or � -> 'a, even if it looks strange at the first
glance.

>[...]

Kind regards,

Hannah.

From: Tim Bradshaw
Subject: Re: Extended-char -> Standard-char
Date: Wed, 11 Sep 2002 06:34:29 +0000
Message-ID: <ey3vg5dyqyy.fsf@cley.com>

* Hannah Schroeter wrote:
> In fact, there's � (like an un-voiced English th, i.e. as in "bath"),
> � (like a voiced English th, as in "this"). In addition, Icelandic
> has � and � as letters with their own positions in the alphabet,
> as well as �, �, �, �, �, � where the accent may carry a semantic
> difference. And all of them in upcase, too.

This is terribly off-topic by now, but I once read that the use of
`ye' as in `ye olde worlde' is partly due to misreading � - it can
actually be `�e', or in fact what is now `the' or `thee' (the � is
voiced here, but I think it's essentially the same character).  Things
aren't simple because there is probably `ye' as well I think - for the
2nd person pronoun it looks like there should have been:

        nom?    acc?
sing    thou    thee
plur    you     ye

but for the definite article I think � is quite plausible.

--tim