From: NickName
Subject: ID3 question
Date: 
Message-ID: <1104531988.681833.263850@z14g2000cwz.googlegroups.com>
Happy New Year to you all!

First of all, this is not LISP question, but I guess since LISP is
probably a language for AI sort of thing, hence posting here would also
be meaning, btw, a more appropriate NG is refreshed weekly and I have
not got an answer.  Bear with me.

I have a question regarding entropy calculation for a decision tree
using ID3 algorithm.


To refresh your memory, the sample data (training data) looks liks
this:
Day        Outlook        Temp.        HumidityWind        Play Tennis
D1        Sunny        Hot        High        Weak        No
D2        Sunny        Hot        High        Strong        No
D3        OvercastHot        High        Weak        Yes
D4         Rain         Mild        High        Weak        Yes
D5        Rain        Cool        Normal        Weak        Yes
D6        Rain        Cool        Normal        Strong        No
D7        OvercastCool        Normal        Weak        Yes
D8        Sunny        Mild        High        Weak        No
D9        Sunny        Cold        Normal        Weak        Yes
D10        Rain        Mild        Normal        Strong        Yes
D11        Sunny        Mild        Normal        Strong        Yes
D12        OvercastMild        High        Strong        Yes
D13        OvercastHot        Normal        Weak        Yes
D14        Rain        Mild        High        Strong        No


I believe I have some rudimentary understanding of the information
theory


entropy(S) = -(p1*log(p1)+...+pn*log(pn))
information gain (attribute, set) = ... (can't display formula here)

No problem with the first "pass" for entropies (hence gain) such as
HUMIDITYhigh = (- (3/7) * log2 (3/7) - (4/7) * log2 (4/7) )=0.985
HUMIDITYnormal = (- (6/7) * log2 (6/7) - (1/7) * log2 (1/7) )=0.592
WindWeak: 0.811
WindStrong: 1
...


And Gain4Outlook = 0.246 (the highest among the four attributes).


So, we pick OUTLOOK as the first attribute, OUTLOOK has 3 values of
Sunny, Overcast and Rain, let's start with sunny,



Entropy for the [D1,D2,D8,D9,D11] = entropySet4Sunny = 0.970 (same as
lecture material).

However, at the second "pass", entropy calculation "threw" me off in
the sense that my result is different from ID3 lectures by several
different institutions (they entropies for the second "pass" are the
same, so, I must be the one who's wrong), so, the question is, what
went wrong?


Here's my calculation,
HUMIDITYhigh2 = ( - (0/5) log2 (0/5) - (3/5) * log2 (3/5) )
= ( - 0 - (3/5) * log2 (3/5) )
= 0.442
HUMIDITYnormal2 = ( - (2/5) * log2 (2/5) - (0/5) log2 (0/5) )
= ( - (2/5) * log2 (2/5) - 0 )
= 0.528


BUT "lecture material" reads as
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 - 2/5(0.0) = 0.970
implying that HUMIDITYhigh2 is 0 and HUMIDITYnormal2 is 0 as well.
How did they do their entropy calculation here or what did I do wrong?


Or is there a rule that goes like if one "sub element" is zero then the
entropy is zero like
At the first "pass"
OUTLOOKovercast = ( - (4/4) * log2 (4/4) - 0 )
= 0 (the second "sub element" is zero)
?


Another question,
the tree looks like this
OUTLOOK
/   |     \
sunny overcast rain
/    Y (stop)  \
HUMIDITY            WIND
...                 ...


for the OUTLOOK --> overcast branch, it stops there because (4+,0-)? or
because entrophy at that point is 0 or ?
I'm just getting started.

Another question is (maybe I should hold it for now), what might be a
good ratio/percentage of "training data set" for actual data set?
TIA.


D

From: lin8080
Subject: Re: ID3 question
Date: 
Message-ID: <41D889DD.6D71D7EA@freenet.de>
NickName schrieb:

> First of all, this is not LISP question, ...

Just in case you get no answer, here is one:

Google (search-engine) has a new service for science dependix. 
Give it a try.
From: NickName
Subject: Re: ID3 question
Date: 
Message-ID: <1104641551.097657.102870@c13g2000cwb.googlegroups.com>
Thanks for the response and pointer, probably you were referring to
Google Scholar.  It did find some articles of great interest, however,
each asks for money up front (not that not worthy) but I'm not sure it
would answer my question.  Tomorrow I'm going to re-read the terse
"lecture slide" to see if I missed something last time.

Don
From: lin8080
Subject: Re: ID3 question
Date: 
Message-ID: <41D9ECD7.8897B963@freenet.de>
NickName schrieb:

> Thanks for the response and pointer, probably you were referring to
> Google Scholar.  It did find some articles of great interest, however,
> each asks for money up front (not that not worthy) but I'm not sure it
> would answer my question.  Tomorrow I'm going to re-read the terse
> "lecture slide" to see if I missed something last time.

Uuuh  :(

Now, I find in comp.ai an other post from you. No answer too. Meanwhile
there is something else I found in comp.ai, I assume, you look there,
but still here is copy&paste.

---------copy&paste-------------

A while ago I have made a list:
http://groups-beta.google.com/group/sci.stat.consult/msg/b5069fdbcc9d3512

mag. Aleks Jakulin
http://www.ailab.si/aleks/
---------next-----------------

Check out WEKA: <http://www.cs.waikato.ac.nz/ml/weka/>.  It's Java and
includes an implementation of ID3, although if you're doing something
serious, you should use WEKA's J48, a Java implementation of C4.5,
which is a modern version of ID3.

Mark
---------next (respose)---------

[...]
I then had a look at the code written by Quinlan: the latest version I
could  find was C4.5r8.  After some minor changes to the source it
compiled fine on Cygwin and it has no trouble with the large datasets.

Willem
-------------end----------------

Hope this helps. The ID3 alg. is not kown to me. 
There is a large side to ai on the web:
http://aima.cs.berkeley.edu/ai.html 

lin
From: NickName
Subject: Re: ID3 question
Date: 
Message-ID: <1104807612.545415.234480@f14g2000cwb.googlegroups.com>
Thank you very much for the link, very interesting.  I understood ID3
was probably the first generation algorithm for decision tree, so,
Weka's J48 or the like would likely perform better, however, my goal
was to learn the essentials/fundamentals of machine learning, so, my
approach was like learning to walk before attempt of running.

Other than time limitation and undesirable time frame for learning, so
far, my approach to learning seems productive.  About C4.5, could you
post its alg?  I couldn't find it or haven't tried hard enough.
TIA.

Don
From: lin8080
Subject: Re: ID3 question
Date: 
Message-ID: <41DB7FF3.4450B164@freenet.de>
NickName schrieb:

> Other than time limitation and undesirable time frame for learning, so
> far, my approach to learning seems productive.  About C4.5, could you
> post its alg?  I couldn't find it or haven't tried hard enough.

No, beause I give up using binary systems (too slow and the optim-rate
is over 75%), instead experimenting with cube adress space and three bit
systems accsessing via compelx number rules. Its hard enough to find
some text about this, only chatrooms help me till now. -Good to have
Lisp here.

lin
From: NickName
Subject: Re: ID3 question
Date: 
Message-ID: <1105397052.204124.98850@c13g2000cwb.googlegroups.com>
Sorry for late response.   Sounds like you're way ahead of me in this
field.  Do you think there's some sort of natural relationship between
AI and data visualization or completely not?

Don
From: lin8080
Subject: Re: ID3 question
Date: 
Message-ID: <41E59A9F.F5CC7FB2@freenet.de>
NickName schrieb:

> Sorry for late response.   Sounds like you're way ahead of me in this
> field.  Do you think there's some sort of natural relationship between
> AI and data visualization or completely not?

Oh, and that in english. Well a try:

There is a difference in your question between the human brains and
running computers which have more or less a sequentel copy of the first.
So the answere for human brains is yes, to computers it is maybe.
'Completely not' depends on the way any datas come into the
processing-system, usualy nothing picture-like.

The point is the way human brains do data processing. This is highly a
visual() event. Brains internals do massive abstraction on that and as I
read some AI happens during this process. Most AI surly comes after
that, but this based on the accessible datas inside the memory.

Actually I look at the construct some 'picture-analytic-routines' stored
in a computer memory and it seems impossible to bring in any AI progs
handle on this datasets without the knowledge about the origin to get a
comparable event for simple conclusions or not to compute nonsense.

(There are some interesting routines/papers outside in the net, where
one picture pixel is centered, cause these routins can apply to other
grid values, only small modifications neccessary, like: one tool for
all)

It is too mathematical for me in some points, but sooner or later you
have a value-grid and work on that, while a picture in a computer is a
grid of values too. Also object-describing datas are not very usefull
here, it takes to much time and is not blurred. And here the gen-bits
come in (I have to surf about new papers therefor), but you have only 2
gen-values while nature has 4 (seems to be an optimum?). But thats only
my meaning.

stefan

have a glas of worms :)
From: NickName
Subject: Re: ID3 question
Date: 
Message-ID: <1106620563.740482.175910@c13g2000cwb.googlegroups.com>
Sorry for not following up in time (maybe the glass of 'worms' choked
me :)
Anyway, question about discretization for you if you don't mind.

Quinlan's technique of finding candidate threshold via boundary like
the following would work for the following set(attribute):
Temperature (A/F)  // PlayTennis (C):
40 48 60  72  80  90
No No Yes Yes Yes No

Candidate 1 (48 and 60 pair),
Candidate 2 (80 and 90 pair),
and via info gain, we know candidate 1 is a better choice, so,
the threshold s Temperature>54 (in this case, one misclassification,
that is, the value of 90, correct? note, this is a minor q)

However,
It seems to me, the above technique would not be applicable to the
following set(attribute) of continuous data:
salary(A/F) class (C)
--------------- -----
10000 0
10000 1
100000 0
100000 1
110000 0
110000 1
120000 0
120000 1
130000 0
130000 1
160000 0
170000 1
20000 0
20000 1
30000 0
30000 1
40000 0
40000 1
50000 0
50000 1
60000 0
60000 1
70000 0
70000 1
80000 0
80000 1
90000 0
90000 1

So, the question is, does one have to use some other technique to
handle continuous data like the above? And when may one ignore this
attribute?

TIA.
From: lin8080
Subject: Re: ID3 question
Date: 
Message-ID: <41F8CEE4.BEA430E3@freenet.de>
NickName schrieb:

> Sorry for not following up in time (maybe the glass of 'worms' choked
> me :)
Don't sorry. Some process needs time to think/look/search around.

> Quinlan's technique 
How old is this method?

> Candidate 1 (48 and 60 pair),
> Candidate 2 (80 and 90 pair),
> and via info gain, we know candidate 1 is a better choice, so,
> the threshold s Temperature>54 (in this case, one misclassification,
> that is, the value of 90, correct? note, this is a minor q)
Well, I do not know your classification-scale. From 0 to 100? Looks like
some statistic parts.

> However,
> It seems to me, the above technique would not be applicable to the
> following set(attribute) of continuous data:
> salary(A/F) class (C)
> --------------- -----
> 10000 0   > 10000 1   > 100000 0  > 100000 1  > 110000 0
> 110000 1  > 120000 0  > 120000 1  > 130000 0  > 130000 1
> 160000 0  > 170000 1  > 20000 0   > 20000 1   > 30000 0
> 30000 1   > 40000 0   > 40000 1   > 50000 0   > 50000 1 ...
Then you have to find a better one. Please do yourself some favor: Don't
look how others do that. Do it your way and after some improvements you
will see something new. And then you will have lots of fun in working on
this field (so have I). 
Often such things are done from students and they operate around till
they have what a prof. want to see they have (very bad).

Now looking at your table above, I see "class" is 0 or 1. I do not know
the dependix of that values in your forward use, so you have to proof.
For example, do you need all zeroes or can you delete that "class", so
your table becomes smaller? I case a "salary" is not found with value 1
(one), it is assumed to be 0 by default. This can speed up the
processing.

> So, the question is, does one have to use some other technique to
> handle continuous data like the above? And when may one ignore this
> attribute?

Thats not so easy to answere. You have to look, what you realy need from
a data-set, there is often lots of optimization possible. Kind of the
function like in a KV-diagram. 
In case you find nothing by yourself, you have to use what others find
out, but that will not always fit to your solutions and in most cases
you need time to integrate foreign methods into your process. In other
cases your data-set needs rearrangements only to serve foreign functions
well. Therefor you need some knowledge of what is available and I mean
this is not the best strategie here, but this will do the job.

lin

will someone play tennis under a rainbow?
From: NickName
Subject: Re: ID3 question
Date: 
Message-ID: <1107128271.683397.86370@f14g2000cwb.googlegroups.com>
lin8080 wrote:
> NickName schrieb:
>
> > Sorry for not following up in time (maybe the glass of 'worms'
choked
> > me :)
> Don't sorry. Some process needs time to think/look/search around.
>
> > Quinlan's technique
> How old is this method?
>
> > Candidate 1 (48 and 60 pair),
> > Candidate 2 (80 and 90 pair),
> > and via info gain, we know candidate 1 is a better choice, so,
> > the threshold s Temperature>54 (in this case, one
misclassification,
> > that is, the value of 90, correct? note, this is a minor q)
> Well, I do not know your classification-scale. From 0 to 100? Looks
like
> some statistic parts.
>
> > However,
> > It seems to me, the above technique would not be applicable to the
> > following set(attribute) of continuous data:
> > salary(A/F) class (C)
> > --------------- -----
> > 10000 0   > 10000 1   > 100000 0  > 100000 1  > 110000 0
> > 110000 1  > 120000 0  > 120000 1  > 130000 0  > 130000 1
> > 160000 0  > 170000 1  > 20000 0   > 20000 1   > 30000 0
> > 30000 1   > 40000 0   > 40000 1   > 50000 0   > 50000 1 ...
> Then you have to find a better one. Please do yourself some favor:
Don't
> look how others do that. Do it your way and after some improvements
you
> will see something new. And then you will have lots of fun in working
on
> this field (so have I).
> Often such things are done from students and they operate around till
> they have what a prof. want to see they have (very bad).
>
> Now looking at your table above, I see "class" is 0 or 1. I do not
know
> the dependix of that values in your forward use, so you have to
proof.
> For example, do you need all zeroes or can you delete that "class",
so
> your table becomes smaller? I case a "salary" is not found with value
1
> (one), it is assumed to be 0 by default. This can speed up the
> processing.
>
> > So, the question is, does one have to use some other technique to
> > handle continuous data like the above? And when may one ignore this
> > attribute?
>
> Thats not so easy to answere. You have to look, what you realy need
from
> a data-set, there is often lots of optimization possible. Kind of the
> function like in a KV-diagram.
> In case you find nothing by yourself, you have to use what others
find
> out, but that will not always fit to your solutions and in most cases
> you need time to integrate foreign methods into your process. In
other
> cases your data-set needs rearrangements only to serve foreign
functions
> well. Therefor you need some knowledge of what is available and I
mean
> this is not the best strategie here, but this will do the job.
>
> lin
Sorry for late response.  First, good news, I stayed up late last night
(till 3:38 AM) working on an alg of my own on classification (with
influence from ML concepts/techniques etc.), using an MS datasets
(total of 2054 rows, 7 continuous attributes and one
categorical/norminal attribute, btw, I've discarded quite a few columns
that I've determined not contributing to the class), I divided it into
70% for training and the rest 30% for test, first run accuracy was
incredible (I may post the exact percentage later on, I'd like to check
with some MS engineers to see how best MS can do first :)

Also, I fully agree philosophically one needs to learn from others and
equally important, not to be confined by that whenever possible.

Here, I'd also like to mention that I've read quite a few papers on ML,
and I have to say, I really appreciate Quinlan's clarity in getting his
point(s) cross, superb thought formation and delivery in addition to
great intellect.

Regards,

Don
From: NickName
Subject: Re: ID3 question
Date: 
Message-ID: <1104704104.673750.54750@c13g2000cwb.googlegroups.com>
Here's an update.  OK, I figured them out myself.

The "lecture slide" has the following for the second "PASS"
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 - 2/5(0.0) = 0.970
Gain(Ssunny , Temp.)=0.970-(2/5)0.0 -2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 - 3/5(0.918) = 0.019

Mine is
Gain4Humidity2: 0.970
Gain4TEMP2: 0.599
Gain4Wind2: 0.019

Of which, the my TEMP2 calculation is as follows
-- sunny / temp: 2 mild (1 y / 1 n), 2 hot (2 y) and 1 cool (1 y) <br>
TEMPmild2 =  - (1/5)* log2 (1/5) - (1/5)* log2 (1/5)  =  0.928

Hmm, of the above, for the 2 mild, 1 is yes and 1 is no, so, it's
considered totally random, hence Entropy is considered 1 (so, forget
about the above formula);
for the 2 hot (both are yes), "perfectly classified", so, Entropy is
considered 0;
for the 1 cool (1 yes), also considered "perfectly classified", so,
again Entropy is 0.
Is this analysis/understanding correct?

Also, I figured out when/how a branch/node stops.   Seems simple, if a
node is classifable then stop otherwise continue/iterate.  (comment:
the process actually requires query all the sample data of the node's
value and its corresponding class value, for total beginner, he/she may
not think of it at the onset, that's why I asked previously),

Now, I hope you don't mind answering the following questions:
1) how to select sample data and what percentage, say, we have 5000
rows of actual data, a) certain top percentage, middle and bottom?  b)
totally random selection c)?
2) what "test data" is about (in relation to "sample data")?  

TIA.