CSCE 470 Lecture 19
Jump to navigation
Jump to search
« previous | Monday, October 7, 2013 | next »
Last Week: Clustering
- unsupervised learning
- input: docs, maybe k
- output: clusters of documents
- methods
- k-means
This Week: Classification
- supervised learning
- input: docs, classes
- output: mapping od documents to classes
- methods:
- Rocchio
- kNN
- Naïve Bayes
Document Space
- emails
- web pages
- tweets
- facebook profiles (of my "friends")
- etc.
- spam, not spam
- mentions, no mentions
- college friends, high school friends, olds, everyone else
For now, we will assume classes are mutually exclusive and mutually exhaustive (i.e. partition the document space)
Goal Classifier that can map to . In not so many words,
Learning , the classifier
- Eyeballs (manually)
- Figure out some "rules" (college friend if = and |your.grad_year − my.grad_year| ≤ 4)
- Machine Learning
Neither 1 nor 2 are very useful (i.e. they break and/or don't scale well), so we will be focusing on 3!
Machine Learning
- Learn the classifier (): analyze training data (examples of Docs → classes) [1]
- Testing/Application: use on new stuff.
Vector space classification.
Two assumptions:
- Docs from the same class "bunch" in a contiguous region of space
- Classes are non-overlapping (i.e. drawing a convex blob around points does not include any points from other classes)
Compute centroids for each class in training data.
Assign new documents to closest class centroid.
(for now this assign thing is completely separate from the training step)
- ↑ no free lunch