Canopy clustering algorithm
It is an
unsupervised pre-clustering algorithm performed before a K-means clustering or
Hierarchical clustering.
It is basically
performed to speed up the clustering in the case of large data sets, in which a
direct implementation of the main algorithm may be impractical due to the size
of the data set.
Algorithm
Start with a set/list
of data points and two distance thresholds T1 > T2 for processing.
- Select any point (at random) from this list to form a canopy center.
- Approximate its distance to all other points in the list.
- Put all the points which fall within the distance threshold of T1 into a canopy.
- Remove from the (main/original) list all the points which fall within the threshold of T2. These points are excluded from being the center of and forming new canopies.
- Repeat from step 1 to 4 until the original list is empty.
For an exhaustive
study please go through a paper by McCallum, Nigam and Ungar, located at http://www.kamalnigam.com/papers/canopy-kdd00.pdf.
References:
- Andrew McCallum, Kamal Nigam and Lyle H. Ungar, Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching
- https://cwiki.apache.org/MAHOUT/canopy-clustering.html
- http://en.wikipedia.org/wiki/Canopy_clustering_algorithm