From: Rmagere
Subject: K-Means Algorithm
Date: 
Message-ID: <c1oiba$cfk$1@news.ox.ac.uk>
Hi,
I have written a naive implementation of the k-means algorithm - ideally it
should be general enough that given appropriate distance, mean, normalize
and denormalize functions it can cover any discourse domain.
I was wondering if I could get some feedback on how to improve on it, i.e.
if it lacks generality and or how to improve its efficiency, not that I need
it for what I am doing is more as chance to improve my coding skills.

Thanks

;;;; -*- mode: Lisp; Syntax: ACL 5.0.1; Package: k-mean, common-lisp -*-
;;;
****************************************************************************
*****
;;; FILE IDENTIFICATION
;;;
;;; Name:     K-Means Algorithm
;;; Project:  k-means
;;; Purpose:  To provide a naive implementation of the K-Means Algorithm
;;; Depends:  none
;;; Author:   Marco Rigolli
;;; Version:  0.1.03 - 27 Feb 2004
;;;
****************************************************************************
*****

(in-package :cl)

(defpackage :k-mean
  (:nicknames :km)
  (:use :cl)
  (:export #:k-mean #:interset-function #:intraset-function
           #:2d-points-euclidian-distance #:2d-points-mean
#:2d-points-normalize #:2d-points-denormalize))

(in-package #:km)

(defun most (fn list)
  "Utility function taken from pag.52 of `On Lisp' by Paul Graham"
  (if (null list)
      (values nil nil)
    (let* ((wins (first list))
           (max (funcall fn wins)))
      (dolist (obj (rest list))
        (let ((score (funcall fn obj)))
          (when (> score max)
            (setf wins obj)
            (setf max score))))
      (values wins max))))

(defclass cluster-set ()
  ((name :accessor name :initarg :name :initform (gensym "cluster-"))
   (centre :accessor centre :initarg :centre :initform nil)
   (data :accessor data :initarg :data :initform nil))
  (:documentation "Class to store clustered data"))

(defmethod print-object ((x cluster-set) stream)
  (format stream "#<~a: ~a centred at ~a with ~a elements" (type-of x) (name
x) (centre x) (length (data x))))

(defun find-closest-centre (distance-fn centres data-point)
  "Given a distance function, a set of centres and a single data-point it
returns the centre closest to the data-point"
  (most #'(lambda(x)(- (funcall distance-fn data-point x))) centres))

(defun cluster-data (distance-fn centres data-points)
  "Given a distance function, a set of centres and a set of data-points it
returns a list of clustered data-points"
  (loop for centre in centres
      collecting (loop for data-point in data-points
                     when (equal centre (find-closest-centre distance-fn
centres data-point))
                     collecting data-point)))

(defun clusters-mean (mean-fn clusters)
  "Given a mean function and a list of data clusteres it returns the mean of
each cluster set"
  (loop for cluster in clusters collecting (funcall mean-fn cluster)))

(defun determine-best-centres (distance-fn mean-fn centres data-points)
  "Given a distance function, a mean function, a set of starting centres and
a set of data-points it returns an optimum set of clustering centres"
  (let ((new-centres (clusters-mean mean-fn (cluster-data distance-fn
centres data-points))))
    (if (equal new-centres centres) new-centres
      (determine-best-centres distance-fn mean-fn new-centres
data-points))))

(defun initialize-data-for-k-means (normalize-fn number-of-clusters
data-points)
  "Given a normalizing function, the number of desired clusters and a set of
data points
   it returns the normalized data-points, a set of starting cluster centres
and the normalizing-point"
  (multiple-value-bind (normalized-data normalizing-point) (funcall
normalize-fn data-points)
    (values normalized-data
            (loop for i to (- number-of-clusters 1) collecting (nth i
normalized-data))
            normalizing-point)))

(defun k-mean (distance-fn mean-fn normalize-fn denormalize-fn
number-of-clusters data-points)
  "runs the k-means algorithm on a set of given data-points - after that the
appropriate functions have been also provided"
  (multiple-value-bind (normalized-data centres normalizing-factor)
(initialize-data-for-k-means normalize-fn number-of-clusters data-points)
    (let ((final-centres (determine-best-centres distance-fn mean-fn centres
normalized-data)))
      (loop for centre in final-centres
          collecting (make-instance 'cluster-set
                       :centre (first (funcall denormalize-fn
normalizing-factor (list centre)))
                       :data (funcall denormalize-fn normalizing-factor
                                      (loop for normalized-data-point in
normalized-data
                                          when (equal centre
(find-closest-centre distance-fn final-centres normalized-data-point))
                                          collecting
normalized-data-point)))))))

(defmethod intraset-distance (distance-fn (cluster cluster-set))
  "Returns the average distance of the elements of a cluster from the centre
of the cluster"
  (loop for data-point in (data cluster)
      counting data-point into data-count
      summing (funcall distance-fn data-point (centre cluster)) into
distance-sum
      finally (return (/ distance-sum data-count))))

(defun interset-distance (distance-fn mean-fn clusters)
  "Returns the average distance of the centres of a set of clusters from the
mean (mean-fn) of those centres"
  (let* ((centres-list (mapcar #'centre clusters))
         (baricentre (first (clusters-mean mean-fn (list centres-list)))))
    (loop for centre in centres-list
        counting centre into centre-count
        summing (funcall distance-fn centre baricentre) into centre-sum
        finally (return (/ centre-sum centre-count)))))


;;Functions specific to 2D points locations

(defun 2d-points-euclidian-distance (point-a point-b)
  "Calculates the euclidian distance between two 2D points"
  (sqrt (+ (expt (- (first point-a) (first point-b)) 2) (expt (- (second
point-a) (second point-b)) 2))))

(defun 2d-points-mean (points)
  "Given a set of points it returns the mean point"
  (loop for point in points
      counting point into point-count
      summing (first point) into x-sum
      summing (second point) into y-sum
      finally (return (list (/ x-sum point-count) (/ y-sum point-count)))))

(defun 2d-points-normalize (points)
  "Given a set of points it returns the points normalized and the
'normalizing point'"
  (let ((normalizing-point (list (apply #'max (mapcar #'first points))
(apply #'max (mapcar #'second points)))))
    (values (mapcar #'(lambda(point)(list (/ (first point) (first
normalizing-point)) (/ (second point) (second normalizing-point)))) points)
            normalizing-point)))

(defun 2d-points-denormalize (normalizing-point points)
  "Given a `normalizing point' and a set of normalized points it returns the
points in their original scale"
  (loop for point in points collecting (list (* (first point) (first
normalizing-point)) (* (second point) (second normalizing-point)))))

From: rif
Subject: Re: K-Means Algorithm
Date: 
Message-ID: <wj0k72727yh.fsf@five-percent-nation.mit.edu>
I have a number of comments/suggestions.  People who are more
experienced in CL should feel free to correct me.  I hope this doesn't
come off too harshly, as it is meant to be helpful.  These are all
just opinions, and some of my ideas may not make sense in your
application.

1.  I find that the normalizing/denormalizing adds to the complexity
    of the code without really adding to the power.  If you want to
    normalize and denormalize, just normalize before you pass the data
    in, and denormalize after you get it back --- in other words, do
    it outside the context of the kmeans algorithm.  It just seems
    awkward to pass in a normalizing function, which you then pass to
    an initialization function, which you then simply apply wholesale
    to your data and then don't use again.

2.  I think there are many times you use loop where it would be
    simpler to use other CL primitives.  Loop is worth knowing, but
    your code is awkward because you use it when it's not appropriate.
    For instance, for each of the following pairs, I would replace
    your code (shown first) with the alternative :

    (loop for i to (- number-of-clusters 1) collecting (nth i normalized-data))
    (subseq normalize-data (- number-of-clusters 1))

    (loop for cluster in clusters collecting (funcall mean-fn cluster))
    (map #'mean-fn cluster)

    (loop for data-point in (data cluster)
      counting data-point into data-count
      summing (funcall distance-fn data-point (centre cluster)) into
distance-sum
      finally (return (/ distance-sum data-count))))

    (let ((data (data cluster))
          (centre (centre cluster)))
      (/ (reduce #'+ (map (lambda (d-p) (funcall distance-fn d-p centre)))
                          data))
         (length data))


3.  There is a double loop in cluster-data and in k-mean that is
    basically the same code twice.  You've written the same code twice
    because although you're returning a cluster-set as the final
    result, internal to the algorithm you're not using cluster sets.
    Conceptually, the thing you have at every iteration of kmeans is
    in the same form as the thing you're going to return (a bunch of
    centres, and a collection of which points are closest to those
    centers).  Take advantage of this.

4.  Be aware that the double loop referred to in (3) is not
    conceptually/algorithmically necessary.  To put n points in d
    dimensions into k clusters, it is only necessary to do O(ndk)
    work.  Your approach does O(ndk^2) work.  To get around this, get
    rid of the outer loop --- for each data point, find the closest
    center, and add that data point to the cluster for that center.
    Note that you should only worry about this if you profile your
    code and see this is the bottleneck.  An intermediate approach
    (which is what I do in my own kmeans implementation) is to cache a
    table of all the distances to all the centers (this is O(ndk)
    work) and then to do a double loop but not recompute the distances
    (an additional O(nk^2) amount of work, or O(n(d+k)k) work total.
    Your own mileage may vary, but I suspect since you're using lists
    and generic arithmetic everywhere, you've got plenty of other
    optimization issues to worry about before this one.

5.  You don't need # before a closure or before keywords.  Examples from my code:

(defpackage "KMEANS"
  (:use :common-lisp :dataset :dataset-utils :utilities :gen-opt :combinatorics)
  (:export :kmeans :log-wk))

(in-package :kmeans)

(defun-inline find-closest-index (point centers)
  (optimal-index centers (lambda (c) (l2-distance-squared point c))))

Hope this helps.

Cheers,

rif

"Rmagere" <·······@*the*mail*that*burns*.com> writes:

> Hi,
> I have written a naive implementation of the k-means algorithm - ideally it
> should be general enough that given appropriate distance, mean, normalize
> and denormalize functions it can cover any discourse domain.
> I was wondering if I could get some feedback on how to improve on it, i.e.
> if it lacks generality and or how to improve its efficiency, not that I need
> it for what I am doing is more as chance to improve my coding skills.
> 
> Thanks
> 
> ;;;; -*- mode: Lisp; Syntax: ACL 5.0.1; Package: k-mean, common-lisp -*-
> ;;;
> ****************************************************************************
> *****
> ;;; FILE IDENTIFICATION
> ;;;
> ;;; Name:     K-Means Algorithm
> ;;; Project:  k-means
> ;;; Purpose:  To provide a naive implementation of the K-Means Algorithm
> ;;; Depends:  none
> ;;; Author:   Marco Rigolli
> ;;; Version:  0.1.03 - 27 Feb 2004
> ;;;
> ****************************************************************************
> *****
> 
> (in-package :cl)
> 
> (defpackage :k-mean
>   (:nicknames :km)
>   (:use :cl)
>   (:export #:k-mean #:interset-function #:intraset-function
>            #:2d-points-euclidian-distance #:2d-points-mean
> #:2d-points-normalize #:2d-points-denormalize))
> 
> (in-package #:km)
> 
> (defun most (fn list)
>   "Utility function taken from pag.52 of `On Lisp' by Paul Graham"
>   (if (null list)
>       (values nil nil)
>     (let* ((wins (first list))
>            (max (funcall fn wins)))
>       (dolist (obj (rest list))
>         (let ((score (funcall fn obj)))
>           (when (> score max)
>             (setf wins obj)
>             (setf max score))))
>       (values wins max))))
> 
> (defclass cluster-set ()
>   ((name :accessor name :initarg :name :initform (gensym "cluster-"))
>    (centre :accessor centre :initarg :centre :initform nil)
>    (data :accessor data :initarg :data :initform nil))
>   (:documentation "Class to store clustered data"))
> 
> (defmethod print-object ((x cluster-set) stream)
>   (format stream "#<~a: ~a centred at ~a with ~a elements" (type-of x) (name
> x) (centre x) (length (data x))))
> 
> (defun find-closest-centre (distance-fn centres data-point)
>   "Given a distance function, a set of centres and a single data-point it
> returns the centre closest to the data-point"
>   (most #'(lambda(x)(- (funcall distance-fn data-point x))) centres))
> 
> (defun cluster-data (distance-fn centres data-points)
>   "Given a distance function, a set of centres and a set of data-points it
> returns a list of clustered data-points"
>   (loop for centre in centres
>       collecting (loop for data-point in data-points
>                      when (equal centre (find-closest-centre distance-fn
> centres data-point))
>                      collecting data-point)))
> 
> (defun clusters-mean (mean-fn clusters)
>   "Given a mean function and a list of data clusteres it returns the mean of
> each cluster set"
>   (loop for cluster in clusters collecting (funcall mean-fn cluster)))
> 
> (defun determine-best-centres (distance-fn mean-fn centres data-points)
>   "Given a distance function, a mean function, a set of starting centres and
> a set of data-points it returns an optimum set of clustering centres"
>   (let ((new-centres (clusters-mean mean-fn (cluster-data distance-fn
> centres data-points))))
>     (if (equal new-centres centres) new-centres
>       (determine-best-centres distance-fn mean-fn new-centres
> data-points))))
> 
> (defun initialize-data-for-k-means (normalize-fn number-of-clusters
> data-points)
>   "Given a normalizing function, the number of desired clusters and a set of
> data points
>    it returns the normalized data-points, a set of starting cluster centres
> and the normalizing-point"
>   (multiple-value-bind (normalized-data normalizing-point) (funcall
> normalize-fn data-points)
>     (values normalized-data
>             (loop for i to (- number-of-clusters 1) collecting (nth i
> normalized-data))
>             normalizing-point)))
> 
> (defun k-mean (distance-fn mean-fn normalize-fn denormalize-fn
> number-of-clusters data-points)
>   "runs the k-means algorithm on a set of given data-points - after that the
> appropriate functions have been also provided"
>   (multiple-value-bind (normalized-data centres normalizing-factor)
> (initialize-data-for-k-means normalize-fn number-of-clusters data-points)
>     (let ((final-centres (determine-best-centres distance-fn mean-fn centres
> normalized-data)))
>       (loop for centre in final-centres
>           collecting (make-instance 'cluster-set
>                        :centre (first (funcall denormalize-fn
> normalizing-factor (list centre)))
>                        :data (funcall denormalize-fn normalizing-factor
>                                       (loop for normalized-data-point in
> normalized-data
>                                           when (equal centre
> (find-closest-centre distance-fn final-centres normalized-data-point))
>                                           collecting
> normalized-data-point)))))))
> 
> (defmethod intraset-distance (distance-fn (cluster cluster-set))
>   "Returns the average distance of the elements of a cluster from the centre
> of the cluster"
>   (loop for data-point in (data cluster)
>       counting data-point into data-count
>       summing (funcall distance-fn data-point (centre cluster)) into
> distance-sum
>       finally (return (/ distance-sum data-count))))
> 
> (defun interset-distance (distance-fn mean-fn clusters)
>   "Returns the average distance of the centres of a set of clusters from the
> mean (mean-fn) of those centres"
>   (let* ((centres-list (mapcar #'centre clusters))
>          (baricentre (first (clusters-mean mean-fn (list centres-list)))))
>     (loop for centre in centres-list
>         counting centre into centre-count
>         summing (funcall distance-fn centre baricentre) into centre-sum
>         finally (return (/ centre-sum centre-count)))))
> 
> 
> ;;Functions specific to 2D points locations
> 
> (defun 2d-points-euclidian-distance (point-a point-b)
>   "Calculates the euclidian distance between two 2D points"
>   (sqrt (+ (expt (- (first point-a) (first point-b)) 2) (expt (- (second
> point-a) (second point-b)) 2))))
> 
> (defun 2d-points-mean (points)
>   "Given a set of points it returns the mean point"
>   (loop for point in points
>       counting point into point-count
>       summing (first point) into x-sum
>       summing (second point) into y-sum
>       finally (return (list (/ x-sum point-count) (/ y-sum point-count)))))
> 
> (defun 2d-points-normalize (points)
>   "Given a set of points it returns the points normalized and the
> 'normalizing point'"
>   (let ((normalizing-point (list (apply #'max (mapcar #'first points))
> (apply #'max (mapcar #'second points)))))
>     (values (mapcar #'(lambda(point)(list (/ (first point) (first
> normalizing-point)) (/ (second point) (second normalizing-point)))) points)
>             normalizing-point)))
> 
> (defun 2d-points-denormalize (normalizing-point points)
>   "Given a `normalizing point' and a set of normalized points it returns the
> points in their original scale"
>   (loop for point in points collecting (list (* (first point) (first
> normalizing-point)) (* (second point) (second normalizing-point)))))
From: Rmagere
Subject: Re: K-Means Algorithm
Date: 
Message-ID: <c1pk67$meg$1@news.ox.ac.uk>
rif wrote:

> 1.  I find that the normalizing/denormalizing adds to the complexity
>     of the code without really adding to the power.  If you want to
>     normalize and denormalize, just normalize before you pass the data
>     in, and denormalize after you get it back --- in other words, do
>     it outside the context of the kmeans algorithm.  It just seems
>     awkward to pass in a normalizing function, which you then pass to
>     an initialization function, which you then simply apply wholesale
>     to your data and then don't use again.

I agree in fact initially I didn't have it (nor the initializing function)
originally,
however it doesn't really have to be used as best-centres and cluster-data
could be called indipendently - I guess I'll create a k-mean and
k-mean-with-normalization functions.

> 2.  I think there are many times you use loop where it would be
>     simpler to use other CL primitives.  Loop is worth knowing, but
>     your code is awkward because you use it when it's not appropriate.
>     For instance, for each of the following pairs, I would replace
>     your code (shown first) with the alternative :

The following suggestions are great especially as I tend to forget most of
the lisp primitives (i.e. I should read CltL2, haven't done so yet, it's
just
sitting fatly on my desk).

>     (loop for i to (- number-of-clusters 1) collecting (nth i
normalized-data))
>     (subseq normalize-data (- number-of-clusters 1))

I like it
>
>     (loop for cluster in clusters collecting (funcall mean-fn cluster))
>     (map #'mean-fn cluster)

As I like this one, for some reason I have no issue using mapcar but I never
seem to use map.

>     (loop for data-point in (data cluster)
>       counting data-point into data-count
>       summing (funcall distance-fn data-point (centre cluster)) into
distance-sum
>       finally (return (/ distance-sum data-count))))
>
>     (let ((data (data cluster))
>           (centre (centre cluster)))
>       (/ (reduce #'+ (map (lambda (d-p) (funcall distance-fn d-p
>                           centre))) data))
>          (length data))

Ok I don't like this one I just feel it is more cluttered and less clear
what's going on, though I guess
I could replace my loop code with:

(loop for data-point in (data cluster) summing (funcall distance-fn
data-point (centre cluster)) into distance-sum
    finally (return (/ distance-sum (length data))))

Mmh I still like best the my first loop in terms of visual prettiness and
clarity of reading.

>
> 3.  There is a double loop in cluster-data and in k-mean that is
>     basically the same code twice.  You've written the same code twice
>     because although you're returning a cluster-set as the final
>     result, internal to the algorithm you're not using cluster sets.
>     Conceptually, the thing you have at every iteration of kmeans is
>     in the same form as the thing you're going to return (a bunch of
>     centres, and a collection of which points are closest to those
>     centers).  Take advantage of this.

I agree again there, guess I should have gone for storing the clustered data
as ( (centre (list of data) (centre (list of data) ) and call rest and first
appropriately


> 4.  Be aware that the double loop referred to in (3) is not
>     conceptually/algorithmically necessary.  To put n points in d
>     dimensions into k clusters, it is only necessary to do O(ndk)
>     work.  Your approach does O(ndk^2) work.  To get around this, get
>     rid of the outer loop --- for each data point, find the closest
>     center, and add that data point to the cluster for that center.
>     Note that you should only worry about this if you profile your
>     code and see this is the bottleneck.  An intermediate approach
>     (which is what I do in my own kmeans implementation) is to cache a
>     table of all the distances to all the centers (this is O(ndk)
>     work) and then to do a double loop but not recompute the distances
>     (an additional O(nk^2) amount of work, or O(n(d+k)k) work total.
>     Your own mileage may vary, but I suspect since you're using lists
>     and generic arithmetic everywhere, you've got plenty of other
>     optimization issues to worry about before this one.

Well to be honest for my own interest I am not likely to want more than 6,7
clusters from
about 100  5,6 dimensional points. So I don't think that I will really run
into any bottlenecks,
this was more for fun and get to know where optimizations could be made if I
had to.

> 5.  You don't need # before a closure or before keywords.  Examples
> from my code:
>
> (defpackage "KMEANS"
>   (:use :common-lisp :dataset :dataset-utils :utilities :gen-opt
>   :combinatorics) (:export :kmeans :log-wk))
>
> (in-package :kmeans)
>
> (defun-inline find-closest-index (point centers)
>   (optimal-index centers (lambda (c) (l2-distance-squared point c))))

So should I inline my functions?


> Hope this helps.
>
This was great, thanks also do you have any of the code available online or
do you know of
lisp repository for algorithm such as k-mean, EM, more or less for stuff
that is talked about in
"Neural Networks for Pattern Recognition" by Bishop? As I looked at cliki
(and google) but
there didn't seem to be any.

Thanks again.
From: rif
Subject: Re: K-Means Algorithm
Date: 
Message-ID: <wj0hdxb1gn7.fsf@five-percent-nation.mit.edu>
> >     (loop for cluster in clusters collecting (funcall mean-fn cluster))
> >     (map #'mean-fn cluster)
> 
> As I like this one, for some reason I have no issue using mapcar but I never
> seem to use map.

Sorry, what I wrote is wrong.  You'd need mapcar there, not map.

> >     (loop for data-point in (data cluster)
> >       counting data-point into data-count
> >       summing (funcall distance-fn data-point (centre cluster)) into
> distance-sum
> >       finally (return (/ distance-sum data-count))))
> >
> >     (let ((data (data cluster))
> >           (centre (centre cluster)))
> >       (/ (reduce #'+ (map (lambda (d-p) (funcall distance-fn d-p
> >                           centre))) data))
> >          (length data))
> 
> Ok I don't like this one I just feel it is more cluttered and less clear
> what's going on, though I guess
> I could replace my loop code with:
> 
> (loop for data-point in (data cluster) summing (funcall distance-fn
> data-point (centre cluster)) into distance-sum
>     finally (return (/ distance-sum (length data))))
> 
> Mmh I still like best the my first loop in terms of visual prettiness and
> clarity of reading.
> 

It's a matter of taste.  I find your second loop construct much
lispier than the first.  In particular, explicitly having a clause in
a for loop that counts the length of a loop seems awkward to me.  On
the other hand, pulling out the length forces the code to traverse the
list twice (barring compiler optimizations that I doubt would happen).
To each their own.

> > 3.  There is a double loop in cluster-data and in k-mean that is
> >     basically the same code twice.  You've written the same code twice
> >     because although you're returning a cluster-set as the final
> >     result, internal to the algorithm you're not using cluster sets.
> >     Conceptually, the thing you have at every iteration of kmeans is
> >     in the same form as the thing you're going to return (a bunch of
> >     centres, and a collection of which points are closest to those
> >     centers).  Take advantage of this.
> 
> I agree again there, guess I should have gone for storing the clustered data
> as ( (centre (list of data) (centre (list of data) ) and call rest and first
> appropriately

Alternately, just use the cluster-set structure internally as well.

> > 5.  You don't need # before a closure or before keywords.  Examples
> > from my code:
> >
> > (defpackage "KMEANS"
> >   (:use :common-lisp :dataset :dataset-utils :utilities :gen-opt
> >   :combinatorics) (:export :kmeans :log-wk))
> >
> > (in-package :kmeans)
> >
> > (defun-inline find-closest-index (point centers)
> >   (optimal-index centers (lambda (c) (l2-distance-squared point c))))
> 
> So should I inline my functions?

Sorry, defun-inline is my own macro.  Ignore the defun-inline --- the
point of that last bit of code was that you don't need to write #'
before a closure.  In other words, you don't need to write:

(defun-inline find-closest-index (point centers)
   (optimal-index centers #'(lambda (c) (l2-distance-squared point c))))
 
> This was great, thanks also do you have any of the code available online or
> do you know of
> lisp repository for algorithm such as k-mean, EM, more or less for stuff
> that is talked about in
> "Neural Networks for Pattern Recognition" by Bishop? As I looked at cliki
> (and google) but
> there didn't seem to be any.

I don't know of such a thing, sorry.  But if you wanted to write one,
I'm sure it would be appreciated!

Cheers,

rif
From: rmagere
Subject: Re: K-Means Algorithm
Date: 
Message-ID: <c1qap7$168$1@news.ox.ac.uk>
rif wrote:
>>>     (loop for cluster in clusters collecting (funcall mean-fn
>>>     cluster)) (map #'mean-fn cluster)
>>
>> As I like this one, for some reason I have no issue using mapcar but
>> I never seem to use map.
>
> Sorry, what I wrote is wrong.  You'd need mapcar there, not map.
>

Yeah I actually do not understand why I used loop rather than mapcar,
mmh it must have been that I had been writing a few loops before and
my mind was kind of stuck on it

>>> 3.  There is a double loop in cluster-data and in k-mean that is
>>>     basically the same code twice.  You've written the same code
>>>     twice because although you're returning a cluster-set as the
>>>     final result, internal to the algorithm you're not using
>>>     cluster sets. Conceptually, the thing you have at every
>>>     iteration of kmeans is in the same form as the thing you're
>>>     going to return (a bunch of centres, and a collection of which
>>>     points are closest to those centers).  Take advantage of this.
>>
>> I agree again there, guess I should have gone for storing the
>> clustered data as ( (centre (list of data) (centre (list of data) )
>> and call rest and first appropriately
>
> Alternately, just use the cluster-set structure internally as well.
>

I'll have a go at modifing this bit.

> Sorry, defun-inline is my own macro.  Ignore the defun-inline --- the
> point of that last bit of code was that you don't need to write #'
> before a closure.  In other words, you don't need to write:
>
> (defun-inline find-closest-index (point centers)
>    (optimal-index centers #'(lambda (c) (l2-distance-squared point
> c))))
>

Oh I was aware of that I just like using #' as a matter of consistency
just like I use setf everywhere and never use setq.

>> do you know of
>> lisp repository for algorithm such as k-mean, EM, more or less for
>> stuff that is talked about in
>> "Neural Networks for Pattern Recognition" by Bishop? As I looked at
>> cliki (and google) but
>> there didn't seem to be any.
>
> I don't know of such a thing, sorry.  But if you wanted to write one,
> I'm sure it would be appreciated!
>

Well I guess that once I have cleaned up this bit of code I can put it on
cliki
and add bits to it as time goes by.
[I'll probably be adding a reference to your comments on the bits that
I will not be changing if you don't mind]
From: Gareth McCaughan
Subject: Re: K-Means Algorithm
Date: 
Message-ID: <8765docbke.fsf@g.mccaughan.ntlworld.com>
"Rmagere" wrote:

> >     (loop for data-point in (data cluster)
> >       counting data-point into data-count
> >       summing (funcall distance-fn data-point (centre cluster)) into distance-sum
> >       finally (return (/ distance-sum data-count))))

"rif" suggested replacing it with:

> >     (let ((data (data cluster))
> >           (centre (centre cluster)))
> >       (/ (reduce #'+ (map (lambda (d-p) (funcall distance-fn d-p
> >                           centre))) data))
> >          (length data))

and "Rmagere" then suggested instead:

> (loop for data-point in (data cluster)
>       summing (funcall distance-fn data-point (centre cluster))
>         into distance-sum
>       finally (return (/ distance-sum (length data))))

I prefer ...

    ;; Utilities:
    
    (defmacro map-expr ((pronoun list) expr)
      `(loop for ,pronoun in ,list collect ,expr))
    
    (defmacro sum-expr ((pronoun list) expr)
      `(loop for ,pronoun in ,list sum ,expr))
    
    (defmacro average-expr ((pronoun list) expr)
      `(loop for ,pronoun in ,list sum ,expr into total
             count t into number
             finally (return (/ total count))))
    
    ;; Code:
    
    (average-expr (data-point (data cluster))
      (funcall distance-fn data-point (centre cluster)))

Although it feels wrong to be computing (CENTRE CLUSTER) over and over
again, so maybe instead

    (let ((centre (centre cluster)))
      (average-expr (data-point (data cluster))
        (funcall distance-fn data-point centre)))

Or, if you don't mind a little consing,

    ;; Utilities:
    
    (defun sum (items) (reduce #'+ items))
    (defun average (items) (/ (sum items) (length items)))
    
    ;; Code:
    
    (average (mapcar
               (lambda (point)
                 (funcall distance-fn point (centre cluster)))
               (data cluster)))

If you have an extensible LOOP implementation, you could
implement an AVERAGING / AVERAGE keyword that would let you
write

    (loop for point in (data cluster)
          average (funcall distance-fn point (centre cluster)))

which would be something like ideal.

-- 
Gareth McCaughan
.sig under construc
From: Rahul Jain
Subject: Re: K-Means Algorithm
Date: 
Message-ID: <87ad2zaf0j.fsf@nyct.net>
Gareth McCaughan <················@pobox.com> writes:

>     ;; Utilities:
>     
>     (defmacro map-expr ((pronoun list) expr)
>       `(loop for ,pronoun in ,list collect ,expr))
>     
>     (defmacro sum-expr ((pronoun list) expr)
>       `(loop for ,pronoun in ,list sum ,expr))
>     
>     (defmacro average-expr ((pronoun list) expr)
>       `(loop for ,pronoun in ,list sum ,expr into total
>              count t into number
>              finally (return (/ total count))))

Pah!

(let ((item (scan 'list items))) ; or scan any other data structure
                                 ; or using any function(s) you like
  (/ (collect-sum item) (collect-length item)))

This code traverses the list once.

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist
From: rmagere
Subject: Re: K-Means Algorithm
Date: 
Message-ID: <c24cdi$n67$1@news.ox.ac.uk>
Rahul Jain wrote:
>
> Pah!
>
> (let ((item (scan 'list items))) ; or scan any other data structure
>                                  ; or using any function(s) you like
>   (/ (collect-sum item) (collect-length item)))
>
> This code traverses the list once.

Rahul Jain wrote:
>
> Pah!
>
> (let ((item (scan 'list items))) ; or scan any other data structure
>                                  ; or using any function(s) you like
>   (/ (collect-sum item) (collect-length item)))
>
> This code traverses the list once.

I know that scan is in Appendix A of CltL2, however I don't think it's
present on common lisp implementations without installing some other package
(like series or whatever), so it seems a bit of an overkill in order do just
a void a loop.
From: Rahul Jain
Subject: Re: K-Means Algorithm
Date: 
Message-ID: <87ishic8qq.fsf@nyct.net>
"rmagere" <·······@*the-mail-that-burns*.com> writes:

> I know that scan is in Appendix A of CltL2, however I don't think it's
> present on common lisp implementations without installing some other package
> (like series or whatever), so it seems a bit of an overkill in order do just
> a void a loop.

Ah, so there's the problem. You shuold be using it throughout the code. 
That way you can amortize its cost. ;)

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist
From: Gareth McCaughan
Subject: Re: K-Means Algorithm
Date: 
Message-ID: <87llmd31om.fsf@g.mccaughan.ntlworld.com>
Rahul Jain wrote:

> Pah!
> 
> (let ((item (scan 'list items))) ; or scan any other data structure
>                                  ; or using any function(s) you like
>   (/ (collect-sum item) (collect-length item)))
> 
> This code traverses the list once.

Yes, SERIES is beautiful.

-- 
Gareth McCaughan
.sig under construc
From: Jan Rychter
Subject: Re: K-Means Algorithm
Date: 
Message-ID: <m2oeq8kakn.fsf@tnuctip.rychter.com>
Rahul Jain wrote:
> Pah!
> 
> (let ((item (scan 'list items))) ; or scan any other data structure
>                                  ; or using any function(s) you like
>   (/ (collect-sum item) (collect-length item)))
> 
> This code traverses the list once.

Hmm. Could you please post a complete example of a function containing
the above code?

It doesn't look like correct SERIES code to me and I am unable to make
it traverse the list once.

--J.
From: Jan Rychter
Subject: Re: K-Means Algorithm
Date: 
Message-ID: <m2k70wk6c1.fsf@tnuctip.rychter.com>
I wrote:
> Rahul Jain wrote:
> > Pah!
> > 
> > (let ((item (scan 'list items))) ; or scan any other data structure
> >                                  ; or using any function(s) you like
> >   (/ (collect-sum item) (collect-length item)))
> > 
> > This code traverses the list once.
> 
> Hmm. Could you please post a complete example of a function containing
> the above code?
> 
> It doesn't look like correct SERIES code to me and I am unable to make
> it traverse the list once.

Please ignore this. The warnings I've been getting and the resulting
unefficient expansion was a result of my CMUCL not executing
(series::install) properly after loading the series system. I still
don't know why it doesn't do that, but after doing a (series::install)
everything works fine and the above does indeed expand into a single
list traversal.

--J.
From: Rahul Jain
Subject: Re: K-Means Algorithm
Date: 
Message-ID: <873c72760d.fsf@nyct.net>
Jan Rychter <···@rychter.com> writes:

> Please ignore this. The warnings I've been getting and the resulting
> unefficient expansion was a result of my CMUCL not executing
> (series::install) properly after loading the series system. I still
> don't know why it doesn't do that, but after doing a (series::install)
> everything works fine and the above does indeed expand into a single
> list traversal.

CMUCL should not run (series::install) after loading the series system. 
(series::install) imports and shadows the appropriate symbols needed for
full series optimization into *package*. This is completely orthogonal
to compiling and loading the implementation of the implementation of the
operators involved.

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist
From: Kalle Olavi Niemitalo
Subject: Re: K-Means Algorithm
Date: 
Message-ID: <87eksdlqd1.fsf@Astalo.kon.iki.fi>
"Rmagere" <·······@*the*mail*that*burns*.com> writes:

>   (loop for data-point in (data cluster)
>       counting data-point into data-count
>       summing (funcall distance-fn data-point (centre cluster)) into
> distance-sum
>       finally (return (/ distance-sum data-count))))

Note that COUNTING counts only true values; if DATA-POINT happens
to be NIL, DATA-COUNT will not increase.  It seems to me that
your functions don't assume any particular representation for the
points, so this behavior may be a bug.
From: rmagere
Subject: Re: K-Means Algorithm
Date: 
Message-ID: <c24c64$mvq$1@news.ox.ac.uk>
Kalle Olavi Niemitalo wrote:
> "Rmagere" <·······@*the*mail*that*burns*.com> writes:
>
>>   (loop for data-point in (data cluster)
>>       counting data-point into data-count
>>       summing (funcall distance-fn data-point (centre cluster)) into
>> distance-sum
>>       finally (return (/ distance-sum data-count))))
>
> Note that COUNTING counts only true values; if DATA-POINT happens
> to be NIL, DATA-COUNT will not increase.  It seems to me that
> your functions don't assume any particular representation for the
> points, so this behavior may be a bug.

You are perfecly right so I guess it will have to be (length data cluster)