CSCE 470 Lecture 15

« previous | Friday, September 27, 2013 | next »

Large-Scale Computing

Divide and Conquer

Split "work" among several workers and merge their individual results

Word Count: Count the number of times each distinct word in a document (or all of Wikipedia)

Synchronization

What if workers need to share results

Failures

What if the network or other hardware fails

Tools

Shared Memory (pthreads)
Message Passing (MPI)
Design Patterns
- Master-Slaves
- Producer Consumer

Summary

Concurrency is difficult to reason about,

even more so at large data center scales

MapReduce

Google has proprientary implementation in C++

Bindings in Java, Python

Hadoop is an open-source implementation in Java

What does it do?

Handles Scheduling: assigns workers to map and reduce tasks
Handles Data distribution
Handles Synchronization
Handles errors and faults
Everything happens on a distributed filesystem

MapReduce in 41 Words:

Goal: count books in library
Map: you count shelf 1, I count up shelf 2, etc.
(the more people we get, the faster this part goes
Reduce: We all get together and add up our individual counts

What do we do?

Define two primary functions:

Map(k,v) -> <k',v'>*
Reduce(k', <v'>*) -> <k', v>

For example: two documents

d1: "the crew"
d2: "of the"

Map(d1, "the crew") => [(the, 1), (crew, 1)]
Map(d2, "of the") => [(of, 1), (the, 1)]

Reduce(the, [1,1]) => (the, 2)
Reduce(crew, [1]) => (crew, 2)
Reduce(of, [1]) => (of, 1)

def mapper(self, key, value):
    for word in value.split():
        yield word, 1
def reducer(self, key, values):
    yield key, sum(values)

Data Flow

Input and final output are stored on distributed file system

Scheduler tries to map tasks "close" to physical storage location of input data

Intermediate results stored on local filesystem of map/reduce workers output is often the input to another MapReduce task