CSCE 470 Lecture 31
« previous | Monday, November 11, 2013 | next »
Learning to Rank
In the beginning, there was boolean retrieval: no ranking.
Ranked results:
- Jaccard (content-bsaed)
- TF-IDF and cosine (content-based)
- Link analysis: PageRank and HITS (not content-based)
Example:
Query: "twerk team"
Other factors we might care about:
- beauty (aesthetics)
- recency of update
- clicks in past day
- bytes
- load time
- query in title
Each of these could have its own parameter:
We could use machine learning to figure out the best value of these parameters
Simple Example
Term proximity is minimum query window size: the query words occur across words.
- "penguin logo" with could mean "penguin blah blah logo" appears on the page.
- Smaller window size is better
Example | Doc ID | Query | Cosine | Judgement | |
---|---|---|---|---|---|
37 | linux operating system | 0.032 | 3 | relevant | |
37 | penguin logo | 0.02 | 4 | nonrelevant | |
238 | operating system | 0.043 | 2 | relevant | |
238 | runtime environment | 0.004 | 2 | nonrelevant |
We can create high-dimensional space over cosine and features!
Use hi-dimensional feature space, find relevant/nonrelevant classifier (need training data)
In 2008, Google admitted to using over 200 features. Now they're using well over 1000.
- is tilde in url? — FEATURE!
- how many times has the article been tweeted in the last day/month/year/etc.? — FEATURE!
- does it have a picture of a cat on the page? — FEATURE!
This is a high abstraction of boolean retrieval. High abstraction of ranked retrieval will be discussed next time.