CSCE 470 Lecture 20

From Notes
Jump to navigation Jump to search

« previous | Wednesday, October 9, 2013 | next »


Quiz on Tuesday; likely over pagerank.

HW5 Out Tonight.


Classification

What? putting things (documents) in things (classes)

Why? spam filtering, enhanced results, topics of interest to me

How? assumption: training data (examples)


Set up Phase

Both Clustering and Classification involve setting up document vectors. How are these fectors obtained? TF-IDF, "features", I don't know... We can choose different ones.


Classifiers

  1. Rocchio:
    • Learn: find centroids of each class in training data
    • Apply: assign new documents to nearest centroid's class
    • Fail: "Multi-modal" or overlapping classes (linear)
  2. K-Nearest Neighbors (KNN):
    • Learn: just have pre-classified data; don't do anything with it (nada)
    • Apply: find nearest neighbors and assign to class that shows up the most. What is best?
    • Pass: allows for "pockets" (non-linear)
  3. Naïve Bayes