CSCE 470 Lecture 31

From Notes
Jump to navigation Jump to search

« previous | Monday, November 11, 2013 | next »


Learning to Rank

In the beginning, there was boolean retrieval: no ranking.

Ranked results:

  • Jaccard (content-bsaed)
  • TF-IDF and cosine (content-based)
  • Link analysis: PageRank and HITS (not content-based)


Example:

Query: "twerk team"

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} score(q, d_1) &= 0.5 \cdot \cos(q,d_1) + 0.5 \cdot PR(d_1) score(q, d_2) &= 0.5 \cdot \cos(q,d_2) + 0.5 \cdot PR(d_2) \end{align}}

Other factors we might care about:

  • beauty (aesthetics)
  • recency of update
  • clicks in past day
  • bytes
  • load time
  • query in title

Each of these could have its own parameter:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle score(q, d_1) = \sum \alpha_i \, \mathrm{param}(q,d_1)}

We could use machine learning to figure out the best value of these parameters


Simple Example

Term proximity Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} is minimum query window size: the query words occur across Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} words.

  • "penguin logo" with Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega = 4} could mean "penguin blah blah logo" appears on the page.
  • Smaller window size is better
Example Doc ID Query Cosine Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} Judgement
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_1} 37 linux operating system 0.032 3 relevant
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_2} 37 penguin logo 0.02 4 nonrelevant
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_3} 238 operating system 0.043 2 relevant
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_4} 238 runtime environment 0.004 2 nonrelevant

We can create high-dimensional space over cosine and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} features!

Use hi-dimensional feature space, find relevant/nonrelevant classifier (need training data)

In 2008, Google admitted to using over 200 features. Now they're using well over 1000.

  • is tilde in url? — FEATURE!
  • how many times has the article been tweeted in the last day/month/year/etc.? — FEATURE!
  • does it have a picture of a cat on the page? — FEATURE!

This is a high abstraction of boolean retrieval. High abstraction of ranked retrieval will be discussed next time.