CSCE 470 Lecture 16

« previous | Monday, September 30, 2013 | next »

Applications of IR Concepts

The process of grouping a set of documents into clusters of similar documents

Documents within a cluster should be similar to each other

Documents from different clusters should be dissimilar

Most Common form of unsupervised learning ^[1]

A very common and important task

Analyze/navigate through Corpus (search without typing)
Improving recall: documents within same cluster usually have similar relevance to information need
Better navigation of search results (like "clouds" on Yippy)
speed up vector space retrieval

Constraint: Number of clusters

Flat vs. Hierarchical: Should we create small clusters, then clusters of clusters, clusters of clusters of clusters, and so on?
Soft vs. Hard: should documents be allowed to belong to more than one cluster or only one?

Input: corpus of $n$ documents, desired number of clusters

Output: set of $k$ clusters containing documents

Since algorithm is a randomized optimization algorithm, we are not guaranteed an optimal solution

We could run K-means multiple times...

Ultimate goal is to minimize distance to points within cluster, maximize distance to points in another cluster

↑ unsupervised learning: learning from raw data as opposed to superfised data where classification of examples is given