CSCE 470 Lecture 16
Jump to navigation
Jump to search
« previous | Monday, September 30, 2013 | next »
Applications of IR Concepts
- Clustering
- Classification
- Recommenders
- Interfaces and Visualization
Clustering
The process of grouping a set of documents into clusters of similar documents
Documents within a cluster should be similar to each other
Documents from different clusters should be dissimilar
Most Common form of unsupervised learning [1]
A very common and important task
Why care?
- Analyze/navigate through Corpus (search without typing)
- Improving recall: documents within same cluster usually have similar relevance to information need
- Better navigation of search results (like "clouds" on Yippy)
- speed up vector space retrieval
Issues
- Document and Similarity Representation: TF-IDF and cosine
Constraint: Number of clusters
- Flat vs. Hierarchical: Should we create small clusters, then clusters of clusters, clusters of clusters of clusters, and so on?
- Soft vs. Hard: should documents be allowed to belong to more than one cluster or only one?
K-Means Flat Clustering Algorithm
Input: corpus of documents, desired number of clusters
Output: set of clusters containing documents
- Pick random "centers"
- Assign remaining points to closest "center"
- Compute new "center"
- Go to 1 unless converged
Analysis
Since algorithm is a randomized optimization algorithm, we are not guaranteed an optimal solution
We could run K-means multiple times...
Ultimate goal is to minimize distance to points within cluster, maximize distance to points in another cluster
Footnotes
- ↑ unsupervised learning: learning from raw data as opposed to superfised data where classification of examples is given