CSCE 470 Lecture 9
« previous | Friday, September 13, 2013 | next »
Ranked Search Results
History
Early Search Engines
- Altavista
- Lycos
- Excite
- Infoseek
- Inktomi
Paid search ranking:
- Goto
- → Overture.com
- → Yahoo!
1998: Link-based ranking pioneered by Google
- Mind (and other search engines) = blown!
- Great user experience in search of a business model
- Meanwhile, Goto is making $1 G per year.
- Google adds "ads"
Algorithmic search results formed by "web crawlers" or "spiders"
Bowtie structure of the web
- In-pages link to the strongly connected component, but don't get linked back to
- Strongly connected component in the center
- Out-pages are linked to by the SCC, but don't link out.
- Tendrils (in-pages that link out; otu-pages that link in), tubes (in-out), islands
Statistical Properties of Text
Zipf's Law
The th most frequent term has frequency proportional to .
is collection frequency; the number of occurrences of the term in the collection
A few terms occur very often, but many terms are infrequent
Balance between diverse, complex words and few simple words.
Similar concepts:
- 80% of ___ come from 20% of ___
- Fat head, chunky middle, Long Tail (first quadrant of )
Query Distribution
Queries have a power law distribution:
Few very frequent words, a large number of very rare words
Portion of adult queries is much lower than 1/3
Heaps' Law
The dictionary size (number of unique words in the index) is proportional to a power of the total word count of the corpus:
This model usually works best with and .
Search optimization and its Dirty Little Secrets
There's a dark side to IR:
- search engine spamming,
- click fraud (bots),