CSCE 470 Lecture 10
« previous | Monday, September 16, 2013 | next »
Link Analysis
Number 2 in our "most important things" list (first is TF-IDF vector space)
- Web as a graph: pages are nodes, links are edges.
- Anchor tesxt
- Google Bomb
- Page Rank
Anchor Text
Two assumptions about anchor text:
- A link is a quality signal (e.g. an endorsement)
- Anchor text actually describes target page.
These assumptions have been violated over and over, but back in the day, it wasn't a problem.
Anchor text would be counted in index with thinking that other people can describe website better than website can describe itself.
This resulted in Google Bombs, so Anchor text is not as heavily used any more.
Page Rank
- First iteration was called "backrub"; used number of incoming links for page ranking (easy to spam)
- (recursively) takes into account credibility of incoming links.
- Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n} is number of pages
- Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle 0 \le \alpha \le 1} is Credibility lending parameter. Suppose Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha = 0.7} for this case
Example:
Method 1
Solve equations simultaneously for page ranks:
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} PR(A) &= \alpha \, PR(C) + \frac{1-\alpha}{n} \\ PR(B) &= \alpha \, PR(A) + \frac{1-\alpha}{n} \\ PR(C) &= \alpha \, PR(B) + \frac{1-\alpha}{n} \\ PR(D) &= \alpha \, \frac{PR(B)}{2} + \frac{1-\alpha}{n} \end{align}}
Method 2
"random surfer model": Suppose we allow a web surfer to randomly surf the web for all eternity.
Markov Chain
Transition Probability Matrix: "if I'm at (row), where do I go next?"
TPM for A, B, C, D above:
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & \frac{1}{2} & \frac{1}{2} \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix}}
Markov Chains must be ergodic in order to converge to give us a page rank. This matrix is not ergodic because of the last row. We'll discuss a "hack" next time...
Kleinberg
He is the "rebel king" (anagram)
- West-coast (Stanford : Google) is working on PageRank
- East-coest (Kleinberg) is working on his own ranking algorithm