CSCE 470 Lecture 5
« previous | Wednesday, September 4, 2013 | next »
Ranking Results
Query: "coach sumlin is nice"
d1: coach brown is really nice d2: coach brown is really nice coach brown is really nice d3: coach sumlin walks across the field and says hello d4: sumlin d5: is
Goal: a scoring function , with
s :: Query -> Document -> Number
D3 satisfies our underlying need
Jaccard Index
Consider the query and documents as sets of words:
- Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle J(q,d_2) = \tfrac{3}{6}} (same set representation as d1)
- Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle J(q,d_3) = \tfrac{2}{11}}
- Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle J(q,d_4) = \tfrac{1}{4}}
- Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle J(q,d_5) = \tfrac{1}{4}}
Oops. Best document is now rated last. What about weighting words?
Term Frequency - Inverse Document Frequency (TF-IDF)
If a word appears more often in document set, then it should be treated with less value
Measure of "informativeness"
- define Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{df}(t)} as count of documents that contain term Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle t}
first attempt: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle 1 - \frac{\mathrm{df}(t)}{n}}- inverse document frequency: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{idf}(t) = \log{\left( \frac{n}{\mathrm{df}(t)} \right)}}
- term frequency (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{tf}(t,d)}
): number of occurrences of a term Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle t}
in a single document Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle d}
.
- On a similar note, collection frequency measures number of occurrences across all documents (i.e. Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{cf}(t) = \sum_{d \in D} \mathrm{tf}(t,d)} )
Ideal case: low IDF (rare words) and high TF
TF-IDF term score:
Sublinear Scaling
If document Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A} has a term Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle t} 50 times and document Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle B} only has a term Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle t} 10 times, is Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A} really 5 times better than Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle B} ?
We can use an alternative formula for term frequency that prevents documents with many occurrences of terms from running away:
And we modify our TF-IDF score to use this new formula:
Relation to Vector Space Model
- How to represent a doc? (TF-IDF vector)
- How to measure similarity of query,doc pairs? (cosine)
TF-IDF Vector
Vector space in which each axis corresponds to a term in the entire collection.
Each component has TF-IDF value
documents # collection of all documents
index = {...} # inverse index / postings of terms to documents
terms = index.keys() # set of all terms across all documents
def tf(t,d):
return tokenize(d).count(t)
def idf(t):
return len(index[t])
def tf_idf(t,d):
return tf(t,d) * idf(t)
for d in documents:
vectors[d.id] = map(lambda t: tf_idf(t,d), terms)