CSCE 470 Lecture 17

« previous | Wednesday, October 2, 2013 | next »

Flat Clustering

Measuring "cluster goodness" for a k-means output:

\sum _{c\in C}\sum _{x\in c}({\vec {x}}-\mu _{c})^{2}

We've discussed meta-looping to dispense with random chance of local optimum, but what about choosing the number of clusters?

We could add penalty to "gooness" that incorporates the number of clusters (i.e. more clusters = bad)

measure "good things" as the most prominent "type" in each cluster.

\mathrm {purity} ={\frac {1}{n}}\sum _{k}\max _{j}|\omega \cap c_{k}|

Lots of little clusters merge to form overall cluster.

Input: document corpus (no $k$ ; implicitly 2)

"Recursively" call k-means on elements of each cluster to further break them up.

Stop case is when each leaf cluster is a document

Whoa! New favorite word: Agglomerative