CSCE 470 Lecture 31
« previous | Monday, November 11, 2013 | next »
Learning to Rank
In the beginning, there was boolean retrieval: no ranking.
Ranked results:
- Jaccard (content-bsaed)
- TF-IDF and cosine (content-based)
- Link analysis: PageRank and HITS (not content-based)
Example:
Query: "twerk team"
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} score(q, d_1) &= 0.5 \cdot \cos(q,d_1) + 0.5 \cdot PR(d_1) score(q, d_2) &= 0.5 \cdot \cos(q,d_2) + 0.5 \cdot PR(d_2) \end{align}}
Other factors we might care about:
- beauty (aesthetics)
- recency of update
- clicks in past day
- bytes
- load time
- query in title
Each of these could have its own parameter:
We could use machine learning to figure out the best value of these parameters
Simple Example
Term proximity is minimum query window size: the query words occur across words.
- "penguin logo" with Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega = 4} could mean "penguin blah blah logo" appears on the page.
- Smaller window size is better
| Example | Doc ID | Query | Cosine | Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} | Judgement |
|---|---|---|---|---|---|
| Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_1} | 37 | linux operating system | 0.032 | 3 | relevant |
| Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_2} | 37 | penguin logo | 0.02 | 4 | nonrelevant |
| Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_3} | 238 | operating system | 0.043 | 2 | relevant |
| Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_4} | 238 | runtime environment | 0.004 | 2 | nonrelevant |
We can create high-dimensional space over cosine and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} features!
Use hi-dimensional feature space, find relevant/nonrelevant classifier (need training data)
In 2008, Google admitted to using over 200 features. Now they're using well over 1000.
- is tilde in url? — FEATURE!
- how many times has the article been tweeted in the last day/month/year/etc.? — FEATURE!
- does it have a picture of a cat on the page? — FEATURE!
This is a high abstraction of boolean retrieval. High abstraction of ranked retrieval will be discussed next time.