Friday, January 27, 2012

Canopy clustering algorithm

Canopy clustering algorithm

It is an unsupervised pre-clustering algorithm performed before a K-means clustering or Hierarchical clustering.
It is basically performed to speed up the clustering in the case of large data sets, in which a direct implementation of the main algorithm may be impractical due to the size of the data set.

Algorithm

Start with a set/list of data points and two distance thresholds T1 > T2 for processing.

  1. Select any point (at random) from this list to form a canopy center.
  2. Approximate its distance to all other points in the list.
  3. Put all the points which fall within the distance threshold of T1 into a canopy.
  4. Remove from the (main/original) list all the points which fall within the threshold of T2. These points are excluded from being the center of and forming new canopies.
  5. Repeat from step 1 to 4 until the original list is empty.
For an exhaustive study please go through a paper by McCallum, Nigam and Ungar, located at http://www.kamalnigam.com/papers/canopy-kdd00.pdf

References:

  1. Andrew McCallum, Kamal Nigam and Lyle H. Ungar, Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching
  2. https://cwiki.apache.org/MAHOUT/canopy-clustering.html
  3. http://en.wikipedia.org/wiki/Canopy_clustering_algorithm

1 comment:

  1. What do we actually mean by "Step 5 original list is empty" ?

    --
    BR,
    Rahul

    ReplyDelete