From: Alex Mizrahi
Subject: how would you write this?? (clustering algorithm)
Date: 
Message-ID: <463cf2c9$0$90265$14726298@news.sunsite.dk>
Hello, All!

i'm trying to implement some clustering algorithm that works somewhat 
similar to k-means algorithm. there are abstract object, a distance function 
and a function that finds center of some list of objects (cluster). 
algorithm has to find clusters of minimal radius for these.

i've already implemented k-means algorithm, and implementation is really 
simple.

however, k-means does not satisfy me, because i have to pass it K, and 
results are barely predictable.

so, i've designed an algorithm that tries to find minimal number of clusters 
having radius not more than max-radius parameter.
also i'd like to speedup thing -- when i have some 200 clusters and 2000 
objects, k-means passes take significant amount of time, doing 400000 
comparisons.

so, my algorithm is something like this:

1. find cluster with biggest radius
2. split it into two, doing kinda k-means
3. for each object in other clusters, move it into one of new clusters if 
it's closer to new cluster's center than to other cluster's center.
4. for each object in the new cluster and for each of other clusters, move 
it into other cluster if it's closer to it's center

this algorithm avoids comparing all objects with all clusters, instead it 
works only for some 'active' clusters and objects, that are likely to be 
changed.

unfortunately, implementation of this algorithm turned into nightmare :(. 
i'm using naive lists to represent clusters, and i defined algorithm in very 
imperative/mutable way; that made it weird as hell.. you can find piece of 
this miserable failure at the bottom of this message.

so, i need to redesign my implementation. i see two possibilities:
 1) purely-functional (i think it says for itself)
 2) mutable-with-advanced-structures
    idea is to have more complex structures than flat lists to make it easy 
to move objects from one cluster to another. some hash tables, etc..

i'm going to try second way first, because i have more experience as 
imperative programmer, and it looks natural for this algorithm, but maybe 
some other suggestions?

(loop with dc1 = (dist-sorted-cluster c1)
      with dc2 = (dist-sorted-cluster c2)
      for dc in dclusters
      for sddc = (strip-dists dc)
      for dc-center = (center sddc)
      for objects = nil
      do (destructuring-bind (mvd new-dc1)
      (move-objects dc-center dc1)
    (setf dc1 new-dc1)
    (setf objects mvd))
      do (destructuring-bind (mvd new-dc2)
      (move-objects dc-center dc2)
    (setf dc1 new-dc2)
    (setf objects (nconc objects mvd)))
     collect (nconc objects sddc))


With best regards, Alex Mizrahi. 

From: Raffael Cavallaro
Subject: Re: how would you write this?? (clustering algorithm)
Date: 
Message-ID: <2007050517265043658-raffaelcavallaro@pasdespamsilvousplaitmaccom>
On 2007-05-05 17:10:32 -0400, "Alex Mizrahi" 
<········@users.sourceforge.net> said:

> however, k-means does not satisfy me, because i have to pass it K, and
> results are barely predictable.

<http://en.wikipedia.org/wiki/Data_clustering>
From: Alex Mizrahi
Subject: Re: how would you write this?? (clustering algorithm)
Date: 
Message-ID: <463cfb34$0$90263$14726298@news.sunsite.dk>
(message (Hello 'Raffael)
(you :wrote  :on '(Sat, 5 May 2007 17:26:50 -0400))
(

 ??>> however, k-means does not satisfy me, because i have to pass it K, and
 ??>> results are barely predictable.

 RC> <http://en.wikipedia.org/wiki/Data_clustering>

i've read this article several times. do you know algorithm that will find 
optimal K for given max radius, and will work faster than k-means?
this article mentions QT algorithm -- that will do what i want, but it would 
be terribly slow, since it does each-to-each comparison many times.
agglomerative hierarchical clustering won't work for same reason -- it 
involves each-to-each comparisons.

some partitional hierarchical clustering won't work because i have pretty 
nasty objects -- 400-dimensional vectors with distance defined as angle 
between them. this space looks very different from our usual 2 and 3 
dimensional spaces -- most vectors are almost orthogonal.
i've already implemented two kinds of trees that did kinda hierarchical 
clustering, and that didn't work well -- it identified some clusters, but 
their radiuses were far from optimal, that's not acceptable for me.

i've analyzed data i'm working with for several months, and i came to 
conclusion that some variation of k-means algorithm should work best with 
it. but primitive k-means does not fully satisfy me, so i need a variation 
that is better for my purposes.

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
"I am everything you want and I am everything you need") 
From: Tamas Papp
Subject: Re: how would you write this?? (clustering algorithm)
Date: 
Message-ID: <87vef6vlqb.fsf@pu100877.student.princeton.edu>
"Alex Mizrahi" <········@users.sourceforge.net> writes:

> i've analyzed data i'm working with for several months, and i came to 
> conclusion that some variation of k-means algorithm should work best with 
> it. but primitive k-means does not fully satisfy me, so i need a variation 
> that is better for my purposes.

Hi Alex,

I don't know if you use Bayesian methods, if so, maybe it would be
better to do this using MCMC.  For example, Google turned up this:

www.people.fas.harvard.edu/~junliu/TechRept/02folder/bcvs.pdf

You might need some matrix algebra.  Currently your best bet seems to
be interfacing to Atlas/Lapack.

Best,

Tamas

-- 
Posted via a free Usenet account from http://www.teranews.com
From: Alex Mizrahi
Subject: Re: how would you write this?? (clustering algorithm)
Date: 
Message-ID: <463cfd41$0$90267$14726298@news.sunsite.dk>
(message (Hello 'Alex)
(you :wrote :to '(All) :on '(Sun, 6 May 2007 00:10:32 +0300))
(

 AM> so, i need to redesign my implementation. i see two possibilities:
 AM>  1) purely-functional (i think it says for itself)
 AM>  2) mutable-with-advanced-structures

i now think i can simplify my algorithm and define all actions as splits and 
re-balancing of two clusters. this way algorithm might be closer to 
'pure-functional' approach..

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
"I am everything you want and I am everything you need")