CSCE 470 Lecture 36

From Notes
Jump to navigation Jump to search

« previous | Friday, November 22, 2013 | next »


WE NEED TO RENAME OUR BUNDLE NAMESPACE!!!!

"company" \ "product" + Bundle


???\Wikipedia\CategorizerBundle

Final Review

"term-document" matrix

  • sparse 1s in a sea of 0s
  • inverted index
  • indexing pipeline → magic!


Foundations

  • Building an index
  • Statistical properties of text (Zipf and Heaps)
  • Evaluation (Precision, Recall, F-Measure, NDCG)
  • MapReduce
  • Interfaces
  • Bow-tie structure of the web

Retrieval Models

  • Boolean
  • Vector Space (cosine, TF-IDF)
  • Link Analysis (PageRank and HITS)
  • Learning to Rank


IR in Action

  • Recommenders (collab filtering, content-based)
  • Clustering (K-Means, HAC) and Classification (Rocchio, KNN, Naïve Bayes)
  • Geo + Location-based
  • Question answering
  • Privacy

The Final

  • Everything is fair game (cumulative)
  • Like in-class quizzez
  • 90 minutes (planning for "75-minute exam")

Logistics

  • 20% of grade
  • Monday
  • Regular Class Time
  • HRBB 131
  • Closed Book
  • Two Pages of notes, formulas, etc
  • No calculators


Three Types of Questions:

  • Short Answer
  • Concept application (walk through the algorithm)
    • K-Means clustering
    • Naïve Bayes Classification
    • Collaborative Filtering
    • ...
  • Synthesis


Practice

Do the posted previous finals


Hub
page that links to pages that directly answer the information need
Authority
page that directly answers an information need

Naïve Bayes

Three classes:

  1. Funny
  2. Rants
  3. Crap


Precalculated probabilities:

  Funny Rants Crap
P(c) 0.1 0.2 0.7
c) 0.05 0.8 0.1
c) 0.3 0.4 0.6
c) 0.05 0.01 0.5
c 0.3 0.02 0.01
c) 0.5 0.01 0.2

Ignore words not in table

Alternative things to do when not in table:

  • ignore
  • smoothing: give unknown words probability 0.0001 for all categories