CSCE 470 Lecture 19
« previous | Monday, October 7, 2013 | next »
Overview
Last Week: Clustering
- unsupervised learning
- input: docs, maybe k
- output: clusters of documents
- methods
- k-means
- HAC
This Week: Classification
- supervised learning
- input: docs, classes
- output: mapping od documents to classes
- methods:
- Rocchio
- kNN
- Naïve Bayes
- SVM
Classification
Document Space Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X}
- emails
- web pages
- tweets
- facebook profiles (of my "friends")
- etc.
Classes Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C}
- spam, not spam
- mentions, no mentions
- college friends, high school friends, olds, everyone else
For now, we will assume classes are mutually exclusive and mutually exhaustive (i.e. partition the document space)
Goal Classifier Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma} that can map Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X} to Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C} . In not so many words,
Approaches
Learning Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma} , the classifier
- Eyeballs (manually)
- Figure out some "rules" (college friend if your.college = my.college and |your.grad_year − my.grad_year| ≤ 4)
- Machine Learning
Neither 1 nor 2 are very useful (i.e. they break and/or don't scale well), so we will be focusing on 3!
Machine Learning
- Learn the classifier (): analyze training data (examples of Docs → classes) [1]
- Testing/Application: use Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma} on new stuff.
Rocchio
Vector space classification.
Two assumptions:
- Docs from the same class "bunch" in a contiguous region of space
- Classes are non-overlapping (i.e. drawing a convex blob around points does not include any points from other classes)
Compute centroids for each class in training data.
Assign new documents to closest class centroid.
(for now this assign thing is completely separate from the training step)
Footnotes
- ↑ no free lunch