CSCE 470 Lecture 19

From Notes
Jump to navigation Jump to search

« previous | Monday, October 7, 2013 | next »


Overview

Last Week: Clustering

  • unsupervised learning
  • input: docs, maybe k
  • output: clusters of documents
  • methods
    • k-means
    • HAC

This Week: Classification

  • supervised learning
  • input: docs, classes
  • output: mapping od documents to classes
  • methods:
    • Rocchio
    • kNN
    • Naïve Bayes
    • SVM

Classification

Document Space

  • emails
  • web pages
  • tweets
  • facebook profiles (of my "friends")
  • etc.

Classes

  • spam, not spam
  • mentions, no mentions
  • college friends, high school friends, olds, everyone else

For now, we will assume classes are mutually exclusive and mutually exhaustive (i.e. partition the document space)

Goal Classifier that can map to . In not so many words,

Approaches

Learning , the classifier

  1. Eyeballs (manually)
  2. Figure out some "rules" (college friend if your.college = my.college and |your.grad_year − my.grad_year| ≤ 4)
  3. Machine Learning

Neither 1 nor 2 are very useful (i.e. they break and/or don't scale well), so we will be focusing on 3!

Machine Learning

  1. Learn the classifier (): analyze training data (examples of Docs → classes) [1]
  2. Testing/Application: use on new stuff.

Rocchio

Vector space classification.

Two assumptions:

  1. Docs from the same class "bunch" in a contiguous region of space
  2. Classes are non-overlapping (i.e. drawing a convex blob around points does not include any points from other classes)

Compute centroids for each class in training data.

Assign new documents to closest class centroid.

(for now this assign thing is completely separate from the training step)

Footnotes

  1. no free lunch