CSCE 470 Lecture 5
« previous | Wednesday, September 4, 2013 | next »
Ranking Results
Query: "coach sumlin is nice"
d1: coach brown is really nice d2: coach brown is really nice coach brown is really nice d3: coach sumlin walks across the field and says hello d4: sumlin d5: is
Goal: a scoring function , with
s :: Query -> Document -> Number
D3 satisfies our underlying need
Jaccard Index
Consider the query and documents as sets of words:
- (same set representation as d1)
Oops. Best document is now rated last. What about weighting words?
Term Frequency - Inverse Document Frequency (TF-IDF)
If a word appears more often in document set, then it should be treated with less value
Measure of "informativeness"
- define as count of documents that contain term
first attempt:- inverse document frequency:
- term frequency (): number of occurrences of a term in a single document .
- On a similar note, collection frequency measures number of occurrences across all documents (i.e. )
Ideal case: low IDF (rare words) and high TF
TF-IDF term score:
Sublinear Scaling
If document has a term 50 times and document only has a term 10 times, is really 5 times better than ?
We can use an alternative formula for term frequency that prevents documents with many occurrences of terms from running away:
And we modify our TF-IDF score to use this new formula:
Relation to Vector Space Model
- How to represent a doc? (TF-IDF vector)
- How to measure similarity of query,doc pairs? (cosine)
TF-IDF Vector
Vector space in which each axis corresponds to a term in the entire collection.
Each component has TF-IDF value
documents # collection of all documents
index = {...} # inverse index / postings of terms to documents
terms = index.keys() # set of all terms across all documents
def tf(t,d):
return tokenize(d).count(t)
def idf(t):
return len(index[t])
def tf_idf(t,d):
return tf(t,d) * idf(t)
for d in documents:
vectors[d.id] = map(lambda t: tf_idf(t,d), terms)