Japanese Character Lookup / Unicode / GUI (longish)

From: Tim Josling
Subject: Japanese Character Lookup / Unicode / GUI (longish)
Date: Sun, 01 Sep 2002 04:22:01 +0000
Message-ID: <3D7195E9.EFC718FE@melbpc.org.au>

I am learning Japanese and I want to make a program to help look up the Kanji
characters and words. 

Background
----------

When learning French, I found that the 'inner loop' of learning a language is
the time taken to look up unknown words in the dictionary. Generally I found I
had to be exposed to a word 6 or more times before I remembered it. Some words
took even longer, for example those that have multiple meanings or meanings
not well aligned with English. If it takes 45 seconds to look up a word, and
you look up each word 6 times and you need to learn about 20,000 words, that's
about 60 by 24 hour days solid of looking up words in a dictionary. For people
with a perfect memory, that would reduce to "only" 10 days. 

These numbers are not made up. As a rule of thumb, it takes about 1,000 hours
to learn a language related to one you know already, and about twice that for
an unrelated language. Most of that time is spent on vocabulary, if you are
using your time well. Grammar is sometimes complex but not usually very large.
When learning German, I actually wore out a dictionary and had to buy a new
copy.

So anyway when learning French I set up a text file and whenever I found a
word I did not know I looked it up in the text file. If it was not in the text
file, I looked it up in the dictionary and put it in the text file. In this
way I drastically sped up my learning rate. Naturally I used emacs, a
lisp-based text editor.

This approach suffices for European languages which are more or less phonetic
and have a concept of alphabetical order. 

Those learning these languages have a large advantage over those learning
languages based on Chinese characters for reasons which I explain below.

Japanese writing is based on Chinese characters, called Kanji in Japanese.
There are also two other character sets, Hiragana and Katakana, which are
based on simplified versions of a small number of chinese characters.

Hiragana and Katakana are phonetic character sets, and are quite small in
number (a few dozen characters of each). Hiragana are used for grammatical
particles such as 'wa' signifying that the preceding word is the main topic of
the sentence, and also for verb and adjective endings. Katakana are used for
spelling out foreign words and also for functions similar to italics in
English.

Kanji are used in the main words in a sentence and for verb roots. They
express the main ideas of the sentence, with the hiragana functioning as
grammatical decoration and embellishment.

In Japanese there are about 2,000 main Kanji, and maybe another 1,000 in
reasonably common use. In Chinese there are tens or hundreds of thousands. 

In Japanese each Kanji character has two main meanings. One is based on the
meaning of the character in Chinese. The second is based on the sound of the
character in Chinese, and what meaning that sound has in Japanese. These two
meanings have different pronunciations and are referred to as the 'readings'
of the character. The pronunciations of the character cannot be determined
from the written character form, except by memorising the pronunciations of
all the characters.

Words are made up of one or more characters. Some words consist of up to 6
characters. There is no space between words, although sometimes the presence
of hiragana can indicate the end of a word.

The process of looking up a word in Japanese is as follows:

1. Identify each character
2. Identify the end of the word is possible. If not possible, you will need to
do a backtracking search of all possible word lengths until you find something
that makes sense. 

The main problem is identifying the character. 

There is no concept of alphabetical order with Kanji, so dictionaries are
organised in various ways. There are usually several search paths:

1. By number of pen strokes. As there are many characters with a given number
of pen strokes, this index is usually used as a last desperate measure. There
is also some fuzziness about the number of pen strokes, especially given that
there have been various attempts to simplify the characters over the years,
which has had the effect of changing the number of pen strokes.

2. By sound. There is a more or less agreed phonetic order, similar to
alphabetical order in English. As mentioned, generally by looking you cannot
tell the sound of a character however so this approach is generally useless
for looking up a character you have read.

3. By 'radical'. Each character is deemed to have a subcomponent which is its
essential element, or radical. There are several dozen radicals. There is a
more or less agreed set of radicals, and a more or less agreed order of
radicals. There are various schemes for determining what is the radical for a
given character. Within a radical, characters are usually ordered in the
dictionary by number of pen strokes.

Looking up a word in a character dictionary can be a major exercise. As an
example, a chinese friend offered to show me how it is done. We took the
example of 'dragon'. It took us several hours to find this in the dictionary.
The reason was that most characters have a radical which is quite small and
simple. However 'dragon' has a very large radical consisting of the whole
character 'dragon'. It is the only character that has dragon as its radical.

So it can easily take 5-10 minutes on average to identify a character. This
makes learning the language drastically slower than it could be.

What Emacs can do already
-------------------------

1. Display hiragana, katakana, and kanji (as well as ascii).

2. Enter these characters, using the 'japanese*' input methods. These input
methods are based on the phonetics of the characters. Where multiple
characters have the same sound, entering the phonetics (in roman characters)
presents you with a  choice, and you select the one you want.

What I want to do
-----------------

I would like to create a small program that allows you to find a character
based on its appearance. The requirements are


1. Display a list of characters showing
-

a) Character appearance

b) Phonetic value

c) Character number in the standard character set defined by the government.


2. Allow filtering and sorting by 
-

a) Radical

b) Number of pen strokes

c) Maybe by phonetic values


3. Allow a way to select the character once you have identified it and have
all its details displayed such as
-

a) The story behind the character (how the character came to be e.g. the
character for "man" is a person with back bent pulling a plough, the character
for "woman" is a person with legs crossed and hand held out for money*).

b) The readings (pronunciations)

c) The meanings of the character.

--> also possibly add the character to a list of accumulated characters, for
use in looking up words. (see 5 below).


4. Some way to enter new character information. From my point of view I would
be happy to hand edit a lisp structure, but I would prefer a simple form for
input.

5. Ability to add word lookup and entry later on. This would be a simple
lookup of 'list of characters' -> word name and word meaning, with probably a
hash table for the first character of the word.

...

>From this I see a need for 

a) Display of unicode, or at least ascii, hiragana katakana and kanji.

b) Standard lisp facilities like hash tables and lists, filtering, sorting,
easy load and store of data structures and general 'write less code to do
more' functionality that you get in lisp.

c) Basic GUI features like radio buttons, check lists etc to control the
display of characters,

d) Character data entry facilities, perhaps via a form.

At the moment I am leaning to emacs lisp.

It supports the character sets needed, it has most lisp features, it has the
gui widgets I need, and it supports entry of the characters in a pretty
convenient form (short of having a 3,000 character keyboard). It also supports
a simple form facility for entering new data.

My questions are:

1. Do any other (free) lisps support unicode. I am told CLISP does but not
CMUCL.
2. Do any other (free) lisps support building a gui easily?
3. Do any other lisps have the equivalent of emacs's form facility, especially
for input of non ascii data?

Ideally also I would like to hear if anyone else has done a similar thing
already, and if so where I could download it. 

Any other ideas of suggestions would be most welcome. I plan to start writing
this in a couple of weeks.

Tim Josling

* I was told this by a person who is paying alimony to three ex wives on three
continents. It may be wrong, though it does look like that.

Re: Japanese Character Lookup / Unicode / GUI (longish) Gordon Joly
Re: Japanese Character Lookup / Unicode / GUI (longish) Paolo Amoroso
- Re: Japanese Character Lookup / Unicode / GUI (longish) Paolo Amoroso
Re: Japanese Character Lookup / Unicode / GUI (longish) Bill Clementson
- Re: Japanese Character Lookup / Unicode / GUI (longish) Bill Clementson
Re: Japanese Character Lookup / Unicode / GUI (longish) Takehiko Abe
Re: Japanese Character Lookup / Unicode / GUI (longish) Dorai Sitaram
- Re: Japanese Character Lookup / Unicode / GUI (longish) Tim Josling
Re: Japanese Character Lookup / Unicode / GUI (longish) Jesper Harder

From: Gordon Joly
Subject: Re: Japanese Character Lookup / Unicode / GUI (longish)
Date: Sun, 01 Sep 2002 11:23:44 +0000
Message-ID: <3d71eab0@212.67.96.135>

In article <·················@melbpc.org.au>,
Tim Josling  <···@melbpc.org.au> wrote:
>I am learning Japanese and I want to make a program to help look up the Kanji
>characters and words. 
>
>Background
>----------
>
>When learning French, I found that the 'inner loop' of learning a language is
>the time taken to look up unknown words in the dictionary. Generally I found I
>had to be exposed to a word 6 or more times before I remembered it. Some words
>took even longer, for example those that have multiple meanings or meanings
>not well aligned with English. If it takes 45 seconds to look up a word, and
>you look up each word 6 times and you need to learn about 20,000 words, that's
>about 60 by 24 hour days solid of looking up words in a dictionary. For people
>with a perfect memory, that would reduce to "only" 10 days. 
>[...]



Remind me of the Chinese Room (John Searle) thought experiment. 

http://www.helsinki.fi/hum/kognitiotiede/searle.html

Gordo

From: Paolo Amoroso
Subject: Re: Japanese Character Lookup / Unicode / GUI (longish)
Date: Sun, 01 Sep 2002 14:34:39 +0000
Message-ID: <tNxxPTSnK2lU4Ywxx7hLIhVaBKsn@4ax.com>

On Sun, 01 Sep 2002 14:22:01 +1000, Tim Josling <···@melbpc.org.au> wrote:

> I am learning Japanese and I want to make a program to help look up the Kanji
> characters and words. 
[...]
> Japanese writing is based on Chinese characters, called Kanji in Japanese.
[...]
> Any other ideas of suggestions would be most welcome. I plan to start writing

Several years ago Dario Giuse, then a member of the Garnet team, wrote a
Lisp system called Chinese Tutor and described it in a paper.
Unfortunately, I don't have the reference handy, but you should be able to
find it at your local computer science department.

Paolo
-- 
EncyCMUCLopedia * Extensive collection of CMU Common Lisp documentation
http://www.paoloamoroso.it/ency/README

From: Paolo Amoroso
Subject: Re: Japanese Character Lookup / Unicode / GUI (longish)
Date: Mon, 02 Sep 2002 15:07:50 +0000
Message-ID: <TVxzPU1FodqK16H0PiMkWoVSHX2Y@4ax.com>

On Sun, 01 Sep 2002 16:34:39 +0200, Paolo Amoroso <·······@mclink.it>
wrote:

> Several years ago Dario Giuse, then a member of the Garnet team, wrote a
> Lisp system called Chinese Tutor and described it in a paper.
> Unfortunately, I don't have the reference handy, but you should be able to

Found it:

  "Lisp as a Rapid Prototyping Environment: The Chinese Tutor"
  Dario Giuse
  in Lisp and Symbolic Computation - An International Journal,
  Kluwer Academic Publishers, May 1987


Paolo
-- 
EncyCMUCLopedia * Extensive collection of CMU Common Lisp documentation
http://www.paoloamoroso.it/ency/README

From: Bill Clementson
Subject: Re: Japanese Character Lookup / Unicode / GUI (longish)
Date: Sun, 01 Sep 2002 16:39:31 +0000
Message-ID: <wkheh94q6w.fsf@attbi.com>

Tim Josling <···@melbpc.org.au> writes:

[snipped lots of interesting stuff]

> My questions are:
> 
> 1. Do any other (free) lisps support unicode. I am told CLISP does but not
> CMUCL.

CLISP and the free versions of ACL & LW all have Unicode support. ACL &
LW have heap limitations - this may or may not be an issue for you.

> 2. Do any other (free) lisps support building a gui easily?

ACL has a form builder. The LW GUI builder is not available in the free
version.

> Any other ideas of suggestions would be most welcome. I plan to start writing
> this in a couple of weeks.

Sounds interesting - good luck & keep us posted on your progress.

--
Bill Clementson

From: Bill Clementson
Subject: Re: Japanese Character Lookup / Unicode / GUI (longish)
Date: Sun, 01 Sep 2002 18:12:18 +0000
Message-ID: <wkbs7hefvd.fsf@attbi.com>

Bill Clementson <·······@attbi.com> writes:

> ACL has a form builder. The LW GUI builder is not available in the free
> version.

Oops, just noticed that you posted from a Linux box. I believe that the ACL
form builder is only available on the Windows version of ACL. Sorry for
the misinformation.

--
Bill Clementson

From: Takehiko Abe
Subject: Re: Japanese Character Lookup / Unicode / GUI (longish)
Date: Sun, 01 Sep 2002 12:52:50 +0000
Message-ID: <keke-0109022152500001@solg4.keke.org>

In article <·················@melbpc.org.au>, Tim Josling wrote:

> The main problem is identifying the character. 
> 
> There is no concept of alphabetical order with Kanji, so dictionaries are
> organised in various ways. There are usually several search paths:
> 
> 1. By number of pen strokes. As there are many characters with a given number
> of pen strokes, this index is usually used as a last desperate measure.
> [...]
> 
> 2. By sound. There is a more or less agreed phonetic order, similar to
> alphabetical order in English. As mentioned, generally by looking you cannot
> tell the sound of a character however so this approach is generally useless
> for looking up a character you have read.

This is not strictly correct. Often you can guess right the sound of the
character by its components, sometimes even its rough meaning.

> 
> 3. By 'radical'. Each character is deemed to have a subcomponent which is its
> essential element, or radical. There are several dozen radicals. There is a
> more or less agreed set of radicals, and a more or less agreed order of
> radicals. There are various schemes for determining what is the radical for a
> given character. Within a radical, characters are usually ordered in the
> dictionary by number of pen strokes.
> 
> Looking up a word in a character dictionary can be a major exercise. As an
> example, a chinese friend offered to show me how it is done. We took the
> example of 'dragon'. It took us several hours to find this in the dictionary.

Several hours is way toooo long. You must have used a wrong dictionary.
If you know how to pronouce the character, it should not take more than a
minute to look it up. If you know the glyph of the character, you can count
its strokes. The character 'dragon' has 16 strokes and my dictionary lists
about 600 characters that cosist of 16 strokes, but it should not take more
than a few minutes to identity the one. If you don't know the sound nor the
glyph, then you need English->Japanese/Chinese dictionary.

> The reason was that most characters have a radical which is quite small and
> simple. However 'dragon' has a very large radical consisting of the whole
> character 'dragon'. It is the only character that has dragon as its radical.

My dictionary is rather small and old, but it lists 3 characters under
'dragon radical'.

-- 
This message was not sent to you unsolicited.

From: Dorai Sitaram
Subject: Re: Japanese Character Lookup / Unicode / GUI (longish)
Date: Mon, 02 Sep 2002 15:52:32 +0000
Message-ID: <al01g0$ggn$1@news.gte.com>

In article <·················@melbpc.org.au>,
Tim Josling  <···@melbpc.org.au> wrote:
>I am learning Japanese and I want to make a program to help look up the Kanji
>characters and words. 
>...
>
>Ideally also I would like to hear if anyone else has done a similar thing
>already, and if so where I could download it. 
>
>Any other ideas of suggestions would be most welcome. I plan to start writing
>this in a couple of weeks.

Do you already know of JWPce (Glenn Rosenthal,
http://www.physics.ucla.edu/~grosenth/japanese.html)?
It is described as a free (GNU) Japanese word
processor, but also includes Jim Breen's online
dictionary with a wealth of lookup mechanisms.  The
only word-prcessing I've used it for was to feed in
roumaji and retrieve the kanji representation and the
meaning (in Eigo), and it works amazingly well in that
regard.  While I have only scratched the surface of
this program, its look-and-feel suggests loving and
careful design.

It is not (in) Lisp though.  Rosenthal distributes
various Windows executables, but he also provides  C++
source that he says will run on Wine on Linux.  (I
haven't developed the intestinal fortitude for Wine
yet, so I haven't tried that version.)

From: Tim Josling
Subject: Re: Japanese Character Lookup / Unicode / GUI (longish)
Date: Thu, 05 Sep 2002 12:53:18 +0000
Message-ID: <3D7753BE.E076FEED@melbpc.org.au>

Dorai Sitaram wrote:
> 
> In article <·················@melbpc.org.au>,
> Tim Josling  <···@melbpc.org.au> wrote:
> >I am learning Japanese and I want to make a program to help look up the Kanji
> >characters and words.
> >...
> >
> >Ideally also I would like to hear if anyone else has done a similar thing
> >already, and if so where I could download it.
> >
> >Any other ideas of suggestions would be most welcome. I plan to start writing
> >this in a couple of weeks.
> 
> Do you already know of JWPce (Glenn Rosenthal,
> http://www.physics.ucla.edu/~grosenth/japanese.html)?
> It is described as a free (GNU) Japanese word
> processor, but also includes Jim Breen's online
> dictionary with a wealth of lookup mechanisms.  The
> only word-prcessing I've used it for was to feed in
> roumaji and retrieve the kanji representation and the
> meaning (in Eigo), and it works amazingly well in that
> regard.  While I have only scratched the surface of
> this program, its look-and-feel suggests loving and
> careful design.
> 
> It is not (in) Lisp though.  Rosenthal distributes
> various Windows executables, but he also provides  C++
> source that he says will run on Wine on Linux.  (I
> haven't developed the intestinal fortitude for Wine
> yet, so I haven't tried that version.)

Thanks. I will have a look at it.

I have found a version of CMUCL that supports Unicode on an experimental
basis. Also I eventually found Jim Green's Kanji and Japanese repositories,
which have been used to build several lookup programs, even including one that
runs on a palm pilot. Few are written in lisp (one prototype was which I will
follow up), but they are good enough e.g. JavaDict. Looking up characters is
far far faster than with a dictionary.

The main thing is the data is there for all to use.

A couple of people corrected various aspects of my knowledge of Japanese. For
example, the character components do contain some hints on pronunciation and
meaning. But I think that is more use to an expert.

Well, I am a beginner ;-). But finding all this will speed up my learning by
orders of magnitude.

Someone commented that I should not have taken to long so find 'dragon'. We
actually found it fast using the phonetics of course, but out objective was to
find it purely as a character. My friend was trying to demonstrate how to look
up characters that are new to you. I agree 4 hours is extreme, but 20 minutes
is not unusual. As the JavaDict page notes "At this stage most people give up
in despair".

But the problem is now solved. In truth the actual computations are quite
simple and do not require lisp. JavaDict is only  4,000 lines of Java. I am
going to do some genetic programming in finance which is probably a better fit
for lisp.

Thanks for all the helpful replies.

Tim Josling

From: Jesper Harder
Subject: Re: Japanese Character Lookup / Unicode / GUI (longish)
Date: Mon, 02 Sep 2002 15:43:12 +0000
Message-ID: <m3wuq4fldb.fsf@defun.localdomain>

Tim Josling <···@melbpc.org.au> writes:

> I am learning Japanese and I want to make a program to help look up
> the Kanji characters and words.  At the moment I am leaning to emacs
> lisp.
>
> Ideally also I would like to hear if anyone else has done a similar
> thing already, and if so where I could download it.

You might want to look at 'kdic.el':

    <http://www.emacswiki.org/cgi-bin/wiki.pl?KanjiDictionary>