CSCE 470 Lecture 31

From Notes
Jump to navigation Jump to search

« previous | Monday, November 11, 2013 | next »


Learning to Rank

In the beginning, there was boolean retrieval: no ranking.

Ranked results:

  • Jaccard (content-bsaed)
  • TF-IDF and cosine (content-based)
  • Link analysis: PageRank and HITS (not content-based)


Example:

Query: "twerk team"

Other factors we might care about:

  • beauty (aesthetics)
  • recency of update
  • clicks in past day
  • bytes
  • load time
  • query in title

Each of these could have its own parameter:

We could use machine learning to figure out the best value of these parameters


Simple Example

Term proximity is minimum query window size: the query words occur across words.

  • "penguin logo" with could mean "penguin blah blah logo" appears on the page.
  • Smaller window size is better
Example Doc ID Query Cosine Judgement
37 linux operating system 0.032 3 relevant
37 penguin logo 0.02 4 nonrelevant
238 operating system 0.043 2 relevant
238 runtime environment 0.004 2 nonrelevant

We can create high-dimensional space over cosine and features!

Use hi-dimensional feature space, find relevant/nonrelevant classifier (need training data)

In 2008, Google admitted to using over 200 features. Now they're using well over 1000.

  • is tilde in url? — FEATURE!
  • how many times has the article been tweeted in the last day/month/year/etc.? — FEATURE!
  • does it have a picture of a cat on the page? — FEATURE!

This is a high abstraction of boolean retrieval. High abstraction of ranked retrieval will be discussed next time.