CSCE 470 Lecture 10

From Notes
Jump to navigation Jump to search

« previous | Monday, September 16, 2013 | next »


Link Analysis

Number 2 in our "most important things" list (first is TF-IDF vector space)

  1. Web as a graph: pages are nodes, links are edges.
  2. Anchor tesxt
  3. Google Bomb
  4. Page Rank


Anchor Text

Two assumptions about anchor text:

  1. A link is a quality signal (e.g. an endorsement)
  2. Anchor text actually describes target page.

These assumptions have been violated over and over, but back in the day, it wasn't a problem.

Anchor text would be counted in index with thinking that other people can describe website better than website can describe itself.

This resulted in Google Bombs, so Anchor text is not as heavily used any more.

Page Rank

  1. First iteration was called "backrub"; used number of incoming links for page ranking (easy to spam)
  2. (recursively) takes into account credibility of incoming links.


Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle PR(p) = \alpha \cdot PR(i) + \frac{1 + \alpha}{n} \,\!}
  • Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n} is number of pages
  • Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle 0 \le \alpha \le 1} is Credibility lending parameter. Suppose Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha = 0.7} for this case

Example:

CSCE 470 Pagerank Example Web Graph.svg

Method 1

Solve equations simultaneously for page ranks:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} PR(A) &= \alpha \, PR(C) + \frac{1-\alpha}{n} \\ PR(B) &= \alpha \, PR(A) + \frac{1-\alpha}{n} \\ PR(C) &= \alpha \, PR(B) + \frac{1-\alpha}{n} \\ PR(D) &= \alpha \, \frac{PR(B)}{2} + \frac{1-\alpha}{n} \end{align}}


Method 2

"random surfer model": Suppose we allow a web surfer to randomly surf the web for all eternity.

Markov Chain

Transition Probability Matrix: "if I'm at (row), where do I go next?"

TPM for A, B, C, D above:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & \frac{1}{2} & \frac{1}{2} \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix}}

Markov Chains must be ergodic in order to converge to give us a page rank. This matrix is not ergodic because of the last row. We'll discuss a "hack" next time...

Kleinberg

He is the "rebel king" (anagram)

  • West-coast (Stanford : Google) is working on PageRank
  • East-coest (Kleinberg) is working on his own ranking algorithm