CSCE 470 Lecture 6

From Notes
Jump to navigation Jump to search

« previous | Friday, September 6, 2013 | next »


Search Ranking

Two Key Questions

  1. How to represent each query or document?
    • set of terms
    • bag of words → TF, log TF, TF IDF
  2. How to measure similarity (or distance) between Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle q} and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle d} ?
    • Jaccard
    • Manhattan Distance (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle |u_1 - v_1| + \dots + |u_n - v_n|}
    • Euclidean Distance ()
    • Cosine

Cosine Similarity

Measures angle between query vector and document vector . Two vectors are similar if

  • the angle between is smaller (i.e. ), and thus
  • the cosine similarity is larger (i.e. Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \cos{\theta} \to 1} )

From vector calculus, we can find the cosine between two vectors as follows:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{sim}\left( \vec{q}, \vec{d} \right) = \cos{\theta} = \frac{\vec{q} \cdot \vec{d}}{\left\| \vec{q} \right\| \, \left\| \vec{d} \right\|}}

If the vectors are stored in a normalized format, the similarity formula becomes much easier:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} \hat{q} &= \frac{\vec{q}}{\left\| \vec{q} \right\|} & \hat{d} &= \frac{\vec{d}}{\left\| \vec{d} \right\|} & \mathrm{sim}\left( \hat{q}, \hat{d} \right) = \cos{\theta} &= \hat{q} \cdot \hat{d} \end{align}}