CSCE 470 Lecture 7

« previous | Monday, September 9, 2013 | next »

Ranking Results Quickly

Let $n$ be 1 billion.

If we compute the cosine similarity for a query over all billion documents, our running time is $O(n)$ to compute scores and $O(n\log {n})$ to sort results.

Instead, if we use a max heap with $k$ elements, we can perform $n$ insertions in $O(n\log {k})$ time, and then read off the elements of the heap.

Tiered Index

Multiple indices:

Tier 1 has "good" documents
Tier 2 has "okay" documents
Tier 3 has unknown documents

Goal: if we can find the first $k$ results from Tier 1 and/or Tier 2, we're good. (No computation)

There are still many components we haven't covered yet:

Positional Indexes
Spelling Correction
k-gram indexes for wildcard queries and spelling Corrections
Document Caching
Word Proximities
etc.

Evaluation

Measurable metric from a test user base

Personal friends and family (tiny)
Lab test group? (small)
Blind tests from randomly sampled users/clients (A/B testing)? (large)
Opt-in for all users (really large)
Even larger audiences...

Metrics

By order of difficulty of measure:

how fast is the search? (easy)
Number of clicks or page views
Click behaviors
Do users buy stuff?

Bottom-line: User Happiness vs. Company Happiness

We're going to approach this with a simplified list of metrics:

Precision
Recall
F-Measure
NDCG

All only focus on relevance

CSCE 470 Lecture 7

Contents

Ranking Results Quickly

Tiered Index

Evaluation

Metrics

Navigation menu

Search