CSCE 470 Lecture 10

From Notes
Jump to navigation Jump to search

« previous | Monday, September 16, 2013 | next »


Link Analysis

Number 2 in our "most important things" list (first is TF-IDF vector space)

  1. Web as a graph: pages are nodes, links are edges.
  2. Anchor tesxt
  3. Google Bomb
  4. Page Rank


Anchor Text

Two assumptions about anchor text:

  1. A link is a quality signal (e.g. an endorsement)
  2. Anchor text actually describes target page.

These assumptions have been violated over and over, but back in the day, it wasn't a problem.

Anchor text would be counted in index with thinking that other people can describe website better than website can describe itself.

This resulted in Google Bombs, so Anchor text is not as heavily used any more.

Page Rank

  1. First iteration was called "backrub"; used number of incoming links for page ranking (easy to spam)
  2. (recursively) takes into account credibility of incoming links.


  • is number of pages
  • is Credibility lending parameter. Suppose for this case

Example:

CSCE 470 Pagerank Example Web Graph.svg

Method 1

Solve equations simultaneously for page ranks:


Method 2

"random surfer model": Suppose we allow a web surfer to randomly surf the web for all eternity.

Markov Chain

Transition Probability Matrix: "if I'm at (row), where do I go next?"

TPM for A, B, C, D above:

Markov Chains must be ergodic in order to converge to give us a page rank. This matrix is not ergodic because of the last row. We'll discuss a "hack" next time...

Kleinberg

He is the "rebel king" (anagram)

  • West-coast (Stanford : Google) is working on PageRank
  • East-coest (Kleinberg) is working on his own ranking algorithm