CSCE 470 Lecture 5

« previous | Wednesday, September 4, 2013 | next »

Ranking Results

Query: "coach sumlin is nice"

d1: coach brown is really nice
d2: coach brown is really nice coach brown is really nice
d3: coach sumlin walks across the field and says hello
d4: sumlin
d5: is

Goal: a scoring function $s$ , with

s :: Query -> Document -> Number

D3 satisfies our underlying need

Jaccard Index

J(A,B)={\frac {|A\cap B|}{|A\cup B|}}

Consider the query and documents as sets of words:

$J(q,d_{1})={\tfrac {3}{6}}$
$J(q,d_{2})={\tfrac {3}{6}}$ (same set representation as d1)
$J(q,d_{3})={\tfrac {2}{11}}$
$J(q,d_{4})={\tfrac {1}{4}}$
$J(q,d_{5})={\tfrac {1}{4}}$

Oops. Best document is now rated last. What about weighting words?

Term Frequency - Inverse Document Frequency (TF-IDF)

If a word appears more often in document set, then it should be treated with less value

Measure of "informativeness"

define $\mathrm {df} (t)$ as count of documents that contain term $t$
first attempt: $1-{\frac {\mathrm {df} (t)}{n}}$
inverse document frequency: $\mathrm {idf} (t)=\log {\left({\frac {n}{\mathrm {df} (t)}}\right)}$
term frequency ( $\mathrm {tf} (t,d)$ ): number of occurrences of a term $t$ $t$ in a single document $d$ $d$ .
- On a similar note, collection frequency measures number of occurrences across all documents (i.e. $\mathrm {cf} (t)=\sum _{d\in D}\mathrm {tf} (t,d)$ )

Ideal case: low IDF (rare words) and high TF

TF-IDF term score:

\mathrm {tfidf} (t,d)=\mathrm {tf} (t,d)\cdot \mathrm {idf} (t)

Sublinear Scaling

If document $A$ has a term $t$ 50 times and document $B$ only has a term $t$ 10 times, is $A$ really 5 times better than $B$ ?

We can use an alternative formula for term frequency that prevents documents with many occurrences of terms from running away:

\mathrm {tf} '(t,d)={\begin{cases}1+\log {\left(\mathrm {tf} (t,d)\right)}&\mathrm {tf} (t,d)>0\\0&{\mbox{otherwise}}\end{cases}}

And we modify our TF-IDF score to use this new formula:

\mathrm {tfidf} '(t,d)=\mathrm {tf} '(t,d)\cdot \mathrm {idf} (t)

Relation to Vector Space Model

How to represent a doc? (TF-IDF vector)
How to measure similarity of query,doc pairs? (cosine)

TF-IDF Vector

Vector space in which each axis corresponds to a term in the entire collection.

Each component has TF-IDF value

documents             # collection of all documents
index = {...}         # inverse index / postings of terms to documents
terms = index.keys()  # set of all terms across all documents


def tf(t,d):
   return tokenize(d).count(t)

def idf(t):
   return len(index[t])

def tf_idf(t,d):
   return tf(t,d) * idf(t)


for d in documents:
   vectors[d.id] = map(lambda t: tf_idf(t,d), terms)