CSCE 470 Lecture 2
« previous | Wednesday, August 28, 2013 | next »
Abstract IR Architecture
Query → Query Representation + Document → Document Representation → index = hits
Term-Document Incidence Matrix
Rows = terms
Columns = document IDs
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle M_{ij}} is 1 if term Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle i} appears in document Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle j} , 0 otherwise
There will be very few ones and very many zeroes, so this will be a sparse matrix
Sorting Keys
- Linear search (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O\left( n \right)} )
- Lexicographically ordered: binary search (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O\left( \log{n} \right)} )
- Hash Map (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O\left( 1 \right)} )
Boolean queries:
AND, OR, NOT
Represent rows of matrix as vector, perform corresponding bitwise operations on vector to get results
no ranking
Index
Throw away all the zeroes, only keep the ones
Dictionary maps term keys to a collection of documents.
Boolean query operations now become set operations (union, intersect, complement)
Search optimizations now come in to play