CSCE 470 Lecture 16

From Notes
Jump to navigation Jump to search

« previous | Monday, September 30, 2013 | next »


Applications of IR Concepts

  • Clustering
  • Classification
  • Recommenders
  • Interfaces and Visualization


Clustering

The process of grouping a set of documents into clusters of similar documents

Documents within a cluster should be similar to each other

Documents from different clusters should be dissimilar

Most Common form of unsupervised learning [1]

A very common and important task

Why care?

  • Analyze/navigate through Corpus (search without typing)
  • Improving recall: documents within same cluster usually have similar relevance to information need
  • Better navigation of search results (like "clouds" on Yippy)
  • speed up vector space retrieval

Issues

  • Document and Similarity Representation: TF-IDF and cosine

Constraint: Number of clusters

  • Flat vs. Hierarchical: Should we create small clusters, then clusters of clusters, clusters of clusters of clusters, and so on?
  • Soft vs. Hard: should documents be allowed to belong to more than one cluster or only one?

K-Means Flat Clustering Algorithm

Input: corpus of documents, desired number of clusters

Output: set of clusters containing documents

  1. Pick random "centers"
  2. Assign remaining points to closest "center"
  3. Compute new "center"
  4. Go to 1 unless converged

Analysis

Since algorithm is a randomized optimization algorithm, we are not guaranteed an optimal solution

We could run K-means multiple times...

Ultimate goal is to minimize distance to points within cluster, maximize distance to points in another cluster

Footnotes

  1. unsupervised learning: learning from raw data as opposed to superfised data where classification of examples is given