CSCE 470 Lecture 16
« previous | Monday, September 30, 2013 | next »
Applications of IR Concepts
- Clustering
- Classification
- Recommenders
- Interfaces and Visualization
Clustering
The process of grouping a set of documents into clusters of similar documents
Documents within a cluster should be similar to each other
Documents from different clusters should be dissimilar
Most Common form of unsupervised learning [1]
A very common and important task
Why care?
- Analyze/navigate through Corpus (search without typing)
- Improving recall: documents within same cluster usually have similar relevance to information need
- Better navigation of search results (like "clouds" on Yippy)
- speed up vector space retrieval
Issues
- Document and Similarity Representation: TF-IDF and cosine
Constraint: Number of clusters
- Flat vs. Hierarchical: Should we create small clusters, then clusters of clusters, clusters of clusters of clusters, and so on?
- Soft vs. Hard: should documents be allowed to belong to more than one cluster or only one?
K-Means Flat Clustering Algorithm
Input: corpus of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n} documents, desired number of clusters
Output: set of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle k} clusters containing documents
- Pick Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle k} random "centers"
- Assign remaining Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n-k} points to closest "center"
- Compute new "center"
- Go to 1 unless converged
Analysis
Since algorithm is a randomized optimization algorithm, we are not guaranteed an optimal solution
We could run K-means multiple times...
Ultimate goal is to minimize distance to points within cluster, maximize distance to points in another cluster
Footnotes
- ↑ unsupervised learning: learning from raw data as opposed to superfised data where classification of examples is given