Datacenter-Network-Aware Online Load Balancing in MapReduce

Resource type
Thesis type
(Thesis) M.Sc.
Date created
2015-07-10
Authors/Contributors
Author: Le, Yanfang
Abstract
MapReduce has emerged as a powerful tool for distributed and scalable processing of voluminous data. Different from earlier heuristics in the very late reduce stage or after seeing all the data, we address the skew from the beginning of data input, and make no assumption about a priori knowledge of the data distribution nor require synchronized operations. We show that the optimal strategy is a constrained version of online minimum makespan and, in the MapReduce context where pairs with identical keys must be scheduled to the same machine, we propose an online algorithm with a provable 2-competitive ratio. We further suggest a sample-based enhancement, which, probabilistically, achieves a 3/2-competitive ratio with a bounded error. Examine the project again, we found that the datacenter network could be a bottleneck in the shuffle subphase of MapReduce. This could potentially lead to a poor overall performance even with a balanced workload and thus needed to be addressed. Earlier studies either assume the network inside a datacenter is of negligible delay and infinite capacity, or use a hop count as the network cost measurement. We consider the realistic bandwidth constraints in real world datacenter networks and propose an effective solution toward near optimal datacenter-network-aware load balancing.
Document
Identifier
etd9102
Copyright statement
Copyright is held by the author.
Permissions
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor: Liu, Jiangchuan
Thesis advisor: Ergun, Funda
Member of collection
Attachment Size
etd9102_YLe.pdf 901.21 KB