CSCE 470 Lecture 7
Jump to navigation
Jump to search
« previous | Monday, September 9, 2013 | next »
Ranking Results Quickly
Let be 1 billion.
If we compute the cosine similarity for a query over all billion documents, our running time is to compute scores and to sort results.
Instead, if we use a max heap with elements, we can perform insertions in time, and then read off the elements of the heap.
Tiered Index
Multiple indices:
- Tier 1 has "good" documents
- Tier 2 has "okay" documents
- Tier 3 has unknown documents
Goal: if we can find the first results from Tier 1 and/or Tier 2, we're good. (No computation)
There are still many components we haven't covered yet:
- Positional Indexes
- Spelling Correction
- k-gram indexes for wildcard queries and spelling Corrections
- Document Caching
- Word Proximities
- etc.
Evaluation
Measurable metric from a test user base
- Personal friends and family (tiny)
- Lab test group? (small)
- Blind tests from randomly sampled users/clients (A/B testing)? (large)
- Opt-in for all users (really large)
- Even larger audiences...
Metrics
By order of difficulty of measure:
- how fast is the search? (easy)
- Number of clicks or page views
- Click behaviors
- Do users buy stuff?
Bottom-line: User Happiness vs. Company Happiness
We're going to approach this with a simplified list of metrics:
- Precision
- Recall
- F-Measure
- NDCG
All only focus on relevance