CSCE 470 Lecture 7

From Notes
Jump to navigation Jump to search

« previous | Monday, September 9, 2013 | next »


Ranking Results Quickly

Let be 1 billion.

If we compute the cosine similarity for a query over all billion documents, our running time is to compute scores and to sort results.

Instead, if we use a max heap with elements, we can perform insertions in time, and then read off the elements of the heap.

Tiered Index

Multiple indices:

  • Tier 1 has "good" documents
  • Tier 2 has "okay" documents
  • Tier 3 has unknown documents

Goal: if we can find the first results from Tier 1 and/or Tier 2, we're good. (No computation)

There are still many components we haven't covered yet:

  • Positional Indexes
  • Spelling Correction
  • k-gram indexes for wildcard queries and spelling Corrections
  • Document Caching
  • Word Proximities
  • etc.


Evaluation

Measurable metric from a test user base

  • Personal friends and family (tiny)
  • Lab test group? (small)
  • Blind tests from randomly sampled users/clients (A/B testing)? (large)
  • Opt-in for all users (really large)
  • Even larger audiences...

Metrics

By order of difficulty of measure:

  • how fast is the search? (easy)
  • Number of clicks or page views
  • Click behaviors
  • Do users buy stuff?

Bottom-line: User Happiness vs. Company Happiness

We're going to approach this with a simplified list of metrics:

  • Precision
  • Recall
  • F-Measure
  • NDCG

All only focus on relevance