CSCE 470 Lecture 31

From Notes
Jump to navigation Jump to search

« previous | Monday, November 11, 2013 | next »


Learning to Rank

In the beginning, there was boolean retrieval: no ranking.

Ranked results:

  • Jaccard (content-bsaed)
  • TF-IDF and cosine (content-based)
  • Link analysis: PageRank and HITS (not content-based)


Example:

Query: "twerk team"

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} score(q, d_1) &= 0.5 \cdot \cos(q,d_1) + 0.5 \cdot PR(d_1) score(q, d_2) &= 0.5 \cdot \cos(q,d_2) + 0.5 \cdot PR(d_2) \end{align}}

Other factors we might care about:

  • beauty (aesthetics)
  • recency of update
  • clicks in past day
  • bytes
  • load time
  • query in title

Each of these could have its own parameter:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle score(q, d_1) = \sum \alpha_i \, \mathrm{param}(q,d_1)}

We could use machine learning to figure out the best value of these parameters


Simple Example

Term proximity is minimum query window size: the query words occur across words.

  • "penguin logo" with Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega = 4} could mean "penguin blah blah logo" appears on the page.
  • Smaller window size is better
Example Doc ID Query Cosine Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} Judgement
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_1} 37 linux operating system 0.032 3 relevant
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_2} 37 penguin logo 0.02 4 nonrelevant
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_3} 238 operating system 0.043 2 relevant
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \phi_4} 238 runtime environment 0.004 2 nonrelevant

We can create high-dimensional space over cosine and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} features!

Use hi-dimensional feature space, find relevant/nonrelevant classifier (need training data)

In 2008, Google admitted to using over 200 features. Now they're using well over 1000.

  • is tilde in url? — FEATURE!
  • how many times has the article been tweeted in the last day/month/year/etc.? — FEATURE!
  • does it have a picture of a cat on the page? — FEATURE!

This is a high abstraction of boolean retrieval. High abstraction of ranked retrieval will be discussed next time.