CSCE 470 Lecture 19
Jump to navigation
Jump to search
« previous | Monday, October 7, 2013 | next »
Overview
Last Week: Clustering
- unsupervised learning
- input: docs, maybe k
- output: clusters of documents
- methods
- k-means
- HAC
This Week: Classification
- supervised learning
- input: docs, classes
- output: mapping od documents to classes
- methods:
- Rocchio
- kNN
- Naïve Bayes
- SVM
Classification
Document Space
- emails
- web pages
- tweets
- facebook profiles (of my "friends")
- etc.
Classes
- spam, not spam
- mentions, no mentions
- college friends, high school friends, olds, everyone else
For now, we will assume classes are mutually exclusive and mutually exhaustive (i.e. partition the document space)
Goal Classifier that can map to . In not so many words,
Approaches
Learning , the classifier
- Eyeballs (manually)
- Figure out some "rules" (college friend if your.college = my.college and |your.grad_year − my.grad_year| ≤ 4)
- Machine Learning
Neither 1 nor 2 are very useful (i.e. they break and/or don't scale well), so we will be focusing on 3!
Machine Learning
- Learn the classifier (): analyze training data (examples of Docs → classes) [1]
- Testing/Application: use on new stuff.
Rocchio
Vector space classification.
Two assumptions:
- Docs from the same class "bunch" in a contiguous region of space
- Classes are non-overlapping (i.e. drawing a convex blob around points does not include any points from other classes)
Compute centroids for each class in training data.
Assign new documents to closest class centroid.
(for now this assign thing is completely separate from the training step)
Footnotes
- ↑ no free lunch