CSCE 470 Lecture 12
« previous | Monday, September 23, 2013 | next »
Quiz Review
- tf-idf / cos vector question
- why does tf matter?
- why does idf matter?
- no link analysis
- terms (know definitions, advantages/disadvantages)
- NDCG
- precision
- recall
- zipf's law
- heaps' law
- tokenization
- normalization
- stemming
- lemmatization
- statistical properties of text
- Would stemming affect precision, recall, both, or neither?
- Improve recall, but would negatively affect precision.
- is tf or idf more important in tweet searching/
- idf: tweets are so short that they are unlikely to have the same word multiple times
Topic-Sensitive Pagerank (TSPR)
Given a graph of the web and two topics—"miley cyrus" and "other"
Page nodes are labeled with topics, so each page has scores—one for each topic.
We run our random surfer analysis thing once for each topic:
- instead of randomly jumping to any page, randomly jumps to only a page labeled with the surf topic
Thus if we have 4 pages A, B, C, and D, where only A and B match our topic, our teleport matrix would be as follows:
Thus our markov chain would converge to a pagerank weighted toward our topic
Note: our "hack" row in the link matrix is identical to the rows in our teleport matrix.
Hubs and Authorities or Hyperlink Induced Topic Search (HITS)
Independently developed on the east coast at Cornell University by Kleinberg
- Authority
- page that directly satisfies my information need
- Hub
- Page that aggregates links to pages
Every page has two scores:
- an authority score , and
- a hub score .
Good authorities are pointed to by good hubs, and good hubs point to good authorities
Similar to our pagerank algorithm, we start out with and for all pages .
Next we proceed to iteratively update all hub and authority scores