CSCE 470 Lecture 15
« previous | Friday, September 27, 2013 | next »
Large-Scale Computing
Divide and Conquer
Split "work" among several workers and merge their individual results
Word Count: Count the number of times each distinct word in a document (or all of Wikipedia)
Synchronization
What if workers need to share results
Failures
What if the network or other hardware fails
Tools
- Shared Memory (pthreads)
- Message Passing (MPI)
- Design Patterns
- Master-Slaves
- Producer Consumer
Summary
Concurrency is difficult to reason about,
even more so at large data center scales
MapReduce
Google has proprientary implementation in C++
- Bindings in Java, Python
Hadoop is an open-source implementation in Java
What does it do?
- Handles Scheduling: assigns workers to map and reduce tasks
- Handles Data distribution
- Handles Synchronization
- Handles errors and faults
- Everything happens on a distributed filesystem
MapReduce in 41 Words:
- Goal: count books in library
- Map: you count shelf 1, I count up shelf 2, etc.
- (the more people we get, the faster this part goes
- Reduce: We all get together and add up our individual counts
What do we do?
Define two primary functions:
Map(k,v) -> <k',v'>*
Reduce(k', <v'>*) -> <k', v>
For example: two documents
- d1: "the crew"
- d2: "of the"
Map(d1, "the crew") => [(the, 1), (crew, 1)] Map(d2, "of the") => [(of, 1), (the, 1)] Reduce(the, [1,1]) => (the, 2) Reduce(crew, [1]) => (crew, 2) Reduce(of, [1]) => (of, 1)
def mapper(self, key, value):
for word in value.split():
yield word, 1
def reducer(self, key, values):
yield key, sum(values)
Data Flow
Input and final output are stored on distributed file system
Scheduler tries to map tasks "close" to physical storage location of input data
Intermediate results stored on local filesystem of map/reduce workers output is often the input to another MapReduce task