CSCE 470 Lecture 15

From Notes
Jump to navigation Jump to search

« previous | Friday, September 27, 2013 | next »


Large-Scale Computing

Divide and Conquer

Split "work" among several workers and merge their individual results

Word Count: Count the number of times each distinct word in a document (or all of Wikipedia)

Synchronization

What if workers need to share results

Failures

What if the network or other hardware fails

Tools

  • Shared Memory (pthreads)
  • Message Passing (MPI)
  • Design Patterns
    • Master-Slaves
    • Producer Consumer

Summary

Concurrency is difficult to reason about,

even more so at large data center scales


MapReduce

Google has proprientary implementation in C++

  • Bindings in Java, Python

Hadoop is an open-source implementation in Java

What does it do?

  • Handles Scheduling: assigns workers to map and reduce tasks
  • Handles Data distribution
  • Handles Synchronization
  • Handles errors and faults
  • Everything happens on a distributed filesystem

MapReduce in 41 Words:

  • Goal: count books in library
  • Map: you count shelf 1, I count up shelf 2, etc.
  • (the more people we get, the faster this part goes
  • Reduce: We all get together and add up our individual counts

What do we do?

Define two primary functions:

  • Map(k,v) -> <k',v'>*
  • Reduce(k', <v'>*) -> <k', v>

For example: two documents

  • d1: "the crew"
  • d2: "of the"
Map(d1, "the crew") => [(the, 1), (crew, 1)]
Map(d2, "of the") => [(of, 1), (the, 1)]

Reduce(the, [1,1]) => (the, 2)
Reduce(crew, [1]) => (crew, 2)
Reduce(of, [1]) => (of, 1)
def mapper(self, key, value):
    for word in value.split():
        yield word, 1
def reducer(self, key, values):
    yield key, sum(values)


Data Flow

Input and final output are stored on distributed file system

Scheduler tries to map tasks "close" to physical storage location of input data

Intermediate results stored on local filesystem of map/reduce workers output is often the input to another MapReduce task