CSCE 470 Lecture 31
« previous | Monday, November 11, 2013 | next »
Learning to Rank
In the beginning, there was boolean retrieval: no ranking.
Ranked results:
- Jaccard (content-bsaed)
- TF-IDF and cosine (content-based)
- Link analysis: PageRank and HITS (not content-based)
Example:
Query: "twerk team"
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} score(q, d_1) &= 0.5 \cdot \cos(q,d_1) + 0.5 \cdot PR(d_1) score(q, d_2) &= 0.5 \cdot \cos(q,d_2) + 0.5 \cdot PR(d_2) \end{align}}
Other factors we might care about:
- beauty (aesthetics)
- recency of update
- clicks in past day
- bytes
- load time
- query in title
Each of these could have its own parameter:
We could use machine learning to figure out the best value of these parameters
Simple Example
Term proximity Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} is minimum query window size: the query words occur across Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} words.
- "penguin logo" with Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega = 4} could mean "penguin blah blah logo" appears on the page.
- Smaller window size is better
| Example | Doc ID | Query | Cosine | Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} | Judgement |
|---|---|---|---|---|---|
| Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_1} | 37 | linux operating system | 0.032 | 3 | relevant |
| Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_2} | 37 | penguin logo | 0.02 | 4 | nonrelevant |
| Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_3} | 238 | operating system | 0.043 | 2 | relevant |
| Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_4} | 238 | runtime environment | 0.004 | 2 | nonrelevant |
We can create high-dimensional space over cosine and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} features!
Use hi-dimensional feature space, find relevant/nonrelevant classifier (need training data)
In 2008, Google admitted to using over 200 features. Now they're using well over 1000.
- is tilde in url? — FEATURE!
- how many times has the article been tweeted in the last day/month/year/etc.? — FEATURE!
- does it have a picture of a cat on the page? — FEATURE!
This is a high abstraction of boolean retrieval. High abstraction of ranked retrieval will be discussed next time.