CSCE 470 Lecture 31

« previous | Monday, November 11, 2013 | next »

Learning to Rank

In the beginning, there was boolean retrieval: no ranking.

Ranked results:

Example:

Query: "twerk team"

${\begin{aligned}score(q,d_{1})&=0.5\cdot \cos(q,d_{1})+0.5\cdot PR(d_{1})score(q,d_{2})&=0.5\cdot \cos(q,d_{2})+0.5\cdot PR(d_{2})\end{aligned}}$

Other factors we might care about:

Each of these could have its own parameter:

score(q,d_{1})=\sum \alpha _{i}\,\mathrm {param} (q,d_{1})

We could use machine learning to figure out the best value of these parameters

Term proximity $\omega$ is minimum query window size: the query words occur across $\omega$ words.

"penguin logo" with $\omega =4$ could mean "penguin blah blah logo" appears on the page.
Smaller window size is better

Example	Doc ID	Query	Cosine	$\omega$	Judgement
$\phi _{1}$	37	linux operating system	0.032	3	relevant
$\phi _{2}$	37	penguin logo	0.02	4	nonrelevant
$\phi _{3}$	238	operating system	0.043	2	relevant
$\phi _{4}$	238	runtime environment	0.004	2	nonrelevant

We can create high-dimensional space over cosine and $\omega$ features!

Use hi-dimensional feature space, find relevant/nonrelevant classifier (need training data)

In 2008, Google admitted to using over 200 features. Now they're using well over 1000.

is tilde in url? — FEATURE!
how many times has the article been tweeted in the last day/month/year/etc.? — FEATURE!
does it have a picture of a cat on the page? — FEATURE!

This is a high abstraction of boolean retrieval. High abstraction of ranked retrieval will be discussed next time.