ashish karki: Canopy clustering algorithm

Friday, January 27, 2012

Canopy clustering algorithm

Canopy clustering algorithm

It is an unsupervised pre-clustering algorithm performed before a K-means clustering or Hierarchical clustering.

It is basically performed to speed up the clustering in the case of large data sets, in which a direct implementation of the main algorithm may be impractical due to the size of the data set.

Algorithm

Start with a set/list of data points and two distance thresholds T1 > T2 for processing.

Select any point (at random) from this list to form a canopy center.
Approximate its distance to all other points in the list.
Put all the points which fall within the distance threshold of T1 into a canopy.
Remove from the (main/original) list all the points which fall within the threshold of T2. These points are excluded from being the center of and forming new canopies.
Repeat from step 1 to 4 until the original list is empty.

For an exhaustive study please go through a paper by McCallum, Nigam and Ungar, located at http://www.kamalnigam.com/papers/canopy-kdd00.pdf.

References:

Andrew McCallum, Kamal Nigam and Lyle H. Ungar, Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching
https://cwiki.apache.org/MAHOUT/canopy-clustering.html
http://en.wikipedia.org/wiki/Canopy_clustering_algorithm

1 comment:

RahulFebruary 26, 2013 at 1:45 AM
What do we actually mean by "Step 5 original list is empty" ?

--
BR,
Rahul
ReplyDelete
Replies

Add comment