CSCE 470 Lecture 12

From Notes
Jump to navigation Jump to search

« previous | Monday, September 23, 2013 | next »


Quiz Review

Would stemming affect precision, recall, both, or neither?
Improve recall, but would negatively affect precision.
is tf or idf more important in tweet searching/
idf: tweets are so short that they are unlikely to have the same word multiple times

Topic-Sensitive Pagerank (TSPR)

Given a graph of the web and two topics—"miley cyrus" and "other"

Page nodes are labeled with topics, so each page has Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n} scores—one for each topic.

We run our random surfer analysis thing once for each topic:

  • instead of randomly jumping to any page, randomly jumps to only a page labeled with the surf topic

Thus if we have 4 pages A, B, C, and D, where only A and B match our topic, our teleport matrix would be as follows:

Thus our markov chain would converge to a pagerank weighted toward our topic

Note: our "hack" row in the link matrix is identical to the rows in our teleport matrix.


Hubs and Authorities or Hyperlink Induced Topic Search (HITS)

Independently developed on the east coast at Cornell University by Kleinberg

Authority
page that directly satisfies my information need
Hub
Page that aggregates links to pages

Every page has two scores:

  • an authority score , and
  • a hub score Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle H(p)} .

Good authorities are pointed to by good hubs, and good hubs point to good authorities

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} A(p)_t &= \sum_{q : q \to p} H(q)_{t-1} \\ H(p)_t &= \sum_{p : p \to q} A(q)_{t-1} \end{align}}

Similar to our pagerank algorithm, we start out with Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle H(p) = 1} and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A(p) = 1} for all pages Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p} .

Next we proceed to iteratively update all hub and authority scores