CSCE 470 Lecture 17

From Notes
Jump to navigation Jump to search

« previous | Wednesday, October 2, 2013 | next »


Flat Clustering

Measuring "cluster goodness" for a k-means output:

Residual Sum of Squares

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_{c \in C} \sum_{x \in c} (\vec{x} - \mu_c)^2}
  • Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \vec{x}} is vector to position
  • Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mu_i} is the centroid (component-wise averages)

We've discussed meta-looping to dispense with random chance of local optimum, but what about choosing the number of clusters?

We could add penalty to "gooness" that incorporates the number of clusters (i.e. more clusters = bad)

Purity

measure "good things" as the most prominent "type" in each cluster.

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{purity} = \frac{1}{n} \sum_k \max_j |\omega \cap c_k|}

Hierarchical Clustering

Lots of little clusters merge to form overall cluster.

Bisecting K-means

Input: document corpus (no Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle k} ; implicitly 2)

"Recursively" call k-means on elements of each cluster to further break them up.

Stop case is when each leaf cluster is a document


Hierarchical Agglomerative (HAC)

Whoa! New favorite word: Agglomerative