Hello, All!
i'm trying to implement some clustering algorithm that works somewhat
similar to k-means algorithm. there are abstract object, a distance function
and a function that finds center of some list of objects (cluster).
algorithm has to find clusters of minimal radius for these.
i've already implemented k-means algorithm, and implementation is really
simple.
however, k-means does not satisfy me, because i have to pass it K, and
results are barely predictable.
so, i've designed an algorithm that tries to find minimal number of clusters
having radius not more than max-radius parameter.
also i'd like to speedup thing -- when i have some 200 clusters and 2000
objects, k-means passes take significant amount of time, doing 400000
comparisons.
so, my algorithm is something like this:
1. find cluster with biggest radius
2. split it into two, doing kinda k-means
3. for each object in other clusters, move it into one of new clusters if
it's closer to new cluster's center than to other cluster's center.
4. for each object in the new cluster and for each of other clusters, move
it into other cluster if it's closer to it's center
this algorithm avoids comparing all objects with all clusters, instead it
works only for some 'active' clusters and objects, that are likely to be
changed.
unfortunately, implementation of this algorithm turned into nightmare :(.
i'm using naive lists to represent clusters, and i defined algorithm in very
imperative/mutable way; that made it weird as hell.. you can find piece of
this miserable failure at the bottom of this message.
so, i need to redesign my implementation. i see two possibilities:
1) purely-functional (i think it says for itself)
2) mutable-with-advanced-structures
idea is to have more complex structures than flat lists to make it easy
to move objects from one cluster to another. some hash tables, etc..
i'm going to try second way first, because i have more experience as
imperative programmer, and it looks natural for this algorithm, but maybe
some other suggestions?
(loop with dc1 = (dist-sorted-cluster c1)
with dc2 = (dist-sorted-cluster c2)
for dc in dclusters
for sddc = (strip-dists dc)
for dc-center = (center sddc)
for objects = nil
do (destructuring-bind (mvd new-dc1)
(move-objects dc-center dc1)
(setf dc1 new-dc1)
(setf objects mvd))
do (destructuring-bind (mvd new-dc2)
(move-objects dc-center dc2)
(setf dc1 new-dc2)
(setf objects (nconc objects mvd)))
collect (nconc objects sddc))
With best regards, Alex Mizrahi.
On 2007-05-05 17:10:32 -0400, "Alex Mizrahi"
<········@users.sourceforge.net> said:
> however, k-means does not satisfy me, because i have to pass it K, and
> results are barely predictable.
<http://en.wikipedia.org/wiki/Data_clustering>
(message (Hello 'Raffael)
(you :wrote :on '(Sat, 5 May 2007 17:26:50 -0400))
(
??>> however, k-means does not satisfy me, because i have to pass it K, and
??>> results are barely predictable.
RC> <http://en.wikipedia.org/wiki/Data_clustering>
i've read this article several times. do you know algorithm that will find
optimal K for given max radius, and will work faster than k-means?
this article mentions QT algorithm -- that will do what i want, but it would
be terribly slow, since it does each-to-each comparison many times.
agglomerative hierarchical clustering won't work for same reason -- it
involves each-to-each comparisons.
some partitional hierarchical clustering won't work because i have pretty
nasty objects -- 400-dimensional vectors with distance defined as angle
between them. this space looks very different from our usual 2 and 3
dimensional spaces -- most vectors are almost orthogonal.
i've already implemented two kinds of trees that did kinda hierarchical
clustering, and that didn't work well -- it identified some clusters, but
their radiuses were far from optimal, that's not acceptable for me.
i've analyzed data i'm working with for several months, and i came to
conclusion that some variation of k-means algorithm should work best with
it. but primitive k-means does not fully satisfy me, so i need a variation
that is better for my purposes.
)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
"I am everything you want and I am everything you need")
"Alex Mizrahi" <········@users.sourceforge.net> writes:
> i've analyzed data i'm working with for several months, and i came to
> conclusion that some variation of k-means algorithm should work best with
> it. but primitive k-means does not fully satisfy me, so i need a variation
> that is better for my purposes.
Hi Alex,
I don't know if you use Bayesian methods, if so, maybe it would be
better to do this using MCMC. For example, Google turned up this:
www.people.fas.harvard.edu/~junliu/TechRept/02folder/bcvs.pdf
You might need some matrix algebra. Currently your best bet seems to
be interfacing to Atlas/Lapack.
Best,
Tamas
--
Posted via a free Usenet account from http://www.teranews.com
(message (Hello 'Alex)
(you :wrote :to '(All) :on '(Sun, 6 May 2007 00:10:32 +0300))
(
AM> so, i need to redesign my implementation. i see two possibilities:
AM> 1) purely-functional (i think it says for itself)
AM> 2) mutable-with-advanced-structures
i now think i can simplify my algorithm and define all actions as splits and
re-balancing of two clusters. this way algorithm might be closer to
'pure-functional' approach..
)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
"I am everything you want and I am everything you need")