CSCE 470 Lecture 16

From Notes
Jump to navigation Jump to search

« previous | Monday, September 30, 2013 | next »


Applications of IR Concepts

  • Clustering
  • Classification
  • Recommenders
  • Interfaces and Visualization


Clustering

The process of grouping a set of documents into clusters of similar documents

Documents within a cluster should be similar to each other

Documents from different clusters should be dissimilar

Most Common form of unsupervised learning [1]

A very common and important task

Why care?

  • Analyze/navigate through Corpus (search without typing)
  • Improving recall: documents within same cluster usually have similar relevance to information need
  • Better navigation of search results (like "clouds" on Yippy)
  • speed up vector space retrieval

Issues

  • Document and Similarity Representation: TF-IDF and cosine

Constraint: Number of clusters

  • Flat vs. Hierarchical: Should we create small clusters, then clusters of clusters, clusters of clusters of clusters, and so on?
  • Soft vs. Hard: should documents be allowed to belong to more than one cluster or only one?

K-Means Flat Clustering Algorithm

Input: corpus of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n} documents, desired number of clusters

Output: set of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle k} clusters containing documents

  1. Pick Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle k} random "centers"
  2. Assign remaining Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n-k} points to closest "center"
  3. Compute new "center"
  4. Go to 1 unless converged

Analysis

Since algorithm is a randomized optimization algorithm, we are not guaranteed an optimal solution

We could run K-means multiple times...

Ultimate goal is to minimize distance to points within cluster, maximize distance to points in another cluster

Footnotes

  1. unsupervised learning: learning from raw data as opposed to superfised data where classification of examples is given