CSCE 470 Lecture 5

From Notes
Jump to navigation Jump to search

« previous | Wednesday, September 4, 2013 | next »


Ranking Results

Query: "coach sumlin is nice"

d1: coach brown is really nice
d2: coach brown is really nice coach brown is really nice
d3: coach sumlin walks across the field and says hello
d4: sumlin
d5: is

Goal: a scoring function , with

s :: Query -> Document -> Number

D3 satisfies our underlying need

Jaccard Index

Consider the query and documents as sets of words:

  • (same set representation as d1)

Oops. Best document is now rated last. What about weighting words?

Term Frequency - Inverse Document Frequency (TF-IDF)

If a word appears more often in document set, then it should be treated with less value

Measure of "informativeness"

  • define as count of documents that contain term
  • first attempt:
  • inverse document frequency:
  • term frequency (): number of occurrences of a term in a single document .
    • On a similar note, collection frequency measures number of occurrences across all documents (i.e. )

Ideal case: low IDF (rare words) and high TF

TF-IDF term score:


Sublinear Scaling

If document has a term 50 times and document only has a term 10 times, is really 5 times better than ?

We can use an alternative formula for term frequency that prevents documents with many occurrences of terms from running away:

And we modify our TF-IDF score to use this new formula:

Relation to Vector Space Model

  1. How to represent a doc? (TF-IDF vector)
  2. How to measure similarity of query,doc pairs? (cosine)


TF-IDF Vector

Vector space in which each axis corresponds to a term in the entire collection.

Each component has TF-IDF value

documents             # collection of all documents
index = {...}         # inverse index / postings of terms to documents
terms = index.keys()  # set of all terms across all documents


def tf(t,d):
   return tokenize(d).count(t)

def idf(t):
   return len(index[t])

def tf_idf(t,d):
   return tf(t,d) * idf(t)


for d in documents:
   vectors[d.id] = map(lambda t: tf_idf(t,d), terms)