CSCE 470 Lecture 5

From Notes
Jump to navigation Jump to search

« previous | Wednesday, September 4, 2013 | next »


Ranking Results

Query: "coach sumlin is nice"

d1: coach brown is really nice
d2: coach brown is really nice coach brown is really nice
d3: coach sumlin walks across the field and says hello
d4: sumlin
d5: is

Goal: a scoring function , with

s :: Query -> Document -> Number

D3 satisfies our underlying need

Jaccard Index

Consider the query and documents as sets of words:

  • Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle J(q,d_2) = \tfrac{3}{6}} (same set representation as d1)
  • Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle J(q,d_3) = \tfrac{2}{11}}
  • Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle J(q,d_4) = \tfrac{1}{4}}
  • Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle J(q,d_5) = \tfrac{1}{4}}

Oops. Best document is now rated last. What about weighting words?

Term Frequency - Inverse Document Frequency (TF-IDF)

If a word appears more often in document set, then it should be treated with less value

Measure of "informativeness"

  • define Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{df}(t)} as count of documents that contain term Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle t}
  • first attempt: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle 1 - \frac{\mathrm{df}(t)}{n}}
  • inverse document frequency: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{idf}(t) = \log{\left( \frac{n}{\mathrm{df}(t)} \right)}}
  • term frequency (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{tf}(t,d)} ): number of occurrences of a term Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle t} in a single document Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle d} .
    • On a similar note, collection frequency measures number of occurrences across all documents (i.e. Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{cf}(t) = \sum_{d \in D} \mathrm{tf}(t,d)} )

Ideal case: low IDF (rare words) and high TF

TF-IDF term score:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{tfidf}(t,d) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t)}


Sublinear Scaling

If document Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A} has a term Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle t} 50 times and document Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle B} only has a term Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle t} 10 times, is Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A} really 5 times better than Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle B} ?

We can use an alternative formula for term frequency that prevents documents with many occurrences of terms from running away:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{tf}'(t,d) = \begin{cases} 1 + \log{ \left( \mathrm{tf}(t,d) \right) } & \mathrm{tf}(t,d) > 0 \\ 0 & \mbox{otherwise}\end{cases}}

And we modify our TF-IDF score to use this new formula:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{tfidf}'(t,d) = \mathrm{tf}'(t,d) \cdot \mathrm{idf}(t)}

Relation to Vector Space Model

  1. How to represent a doc? (TF-IDF vector)
  2. How to measure similarity of query,doc pairs? (cosine)


TF-IDF Vector

Vector space in which each axis corresponds to a term in the entire collection.

Each component has TF-IDF value

documents             # collection of all documents
index = {...}         # inverse index / postings of terms to documents
terms = index.keys()  # set of all terms across all documents


def tf(t,d):
   return tokenize(d).count(t)

def idf(t):
   return len(index[t])

def tf_idf(t,d):
   return tf(t,d) * idf(t)


for d in documents:
   vectors[d.id] = map(lambda t: tf_idf(t,d), terms)