CSCE 470 Lecture 9

From Notes
Jump to navigation Jump to search

« previous | Friday, September 13, 2013 | next »


Ranked Search Results

History

Early Search Engines

  • Altavista
  • Lycos
  • Excite
  • Infoseek
  • Inktomi

Paid search ranking:

  • Goto
  • → Overture.com
  • → Yahoo!

1998: Link-based ranking pioneered by Google

  • Mind (and other search engines) = blown!
  • Great user experience in search of a business model
  • Meanwhile, Goto is making $1 G per year.
  • Google adds "ads"

Algorithmic search results formed by "web crawlers" or "spiders"


Bowtie structure of the web

  • In-pages link to the strongly connected component, but don't get linked back to
  • Strongly connected component in the center
  • Out-pages are linked to by the SCC, but don't link out.
  • Tendrils (in-pages that link out; otu-pages that link in), tubes (in-out), islands


Statistical Properties of Text

Zipf's Law

The th most frequent term has frequency proportional to .

is collection frequency; the number of occurrences of the term in the collection

A few terms occur very often, but many terms are infrequent

Balance between diverse, complex words and few simple words.

Similar concepts:


Query Distribution

Queries have a power law distribution:

Few very frequent words, a large number of very rare words

Portion of adult queries is much lower than 1/3

Heaps' Law

The dictionary size (number of unique words in the index) is proportional to a power of the total word count of the corpus:

This model usually works best with and .

Search optimization and its Dirty Little Secrets

There's a dark side to IR:

  • search engine spamming,
  • click fraud (bots),