Le, Yanfang

Resource type

Thesis

Thesis type

(Thesis) M.Sc.

Date created

2015-07-10

Authors/Contributors

Author: Le, Yanfang

Abstract

MapReduce has emerged as a powerful tool for distributed and scalable processing of voluminous data. Different from earlier heuristics in the very late reduce stage or after seeing all the data, we address the skew from the beginning of data input, and make no assumption about a priori knowledge of the data distribution nor require synchronized operations. We show that the optimal strategy is a constrained version of online minimum makespan and, in the MapReduce context where pairs with identical keys must be scheduled to the same machine, we propose an online algorithm with a provable 2-competitive ratio. We further suggest a sample-based enhancement, which, probabilistically, achieves a 3/2-competitive ratio with a bounded error. Examine the project again, we found that the datacenter network could be a bottleneck in the shuffle subphase of MapReduce. This could potentially lead to a poor overall performance even with a balanced workload and thus needed to be addressed. Earlier studies either assume the network inside a datacenter is of negligible delay and infinite capacity, or use a hop count as the network cost measurement. We consider the realistic bandwidth constraints in real world datacenter networks and propose an effective solution toward near optimal datacenter-network-aware load balancing.

Keywords

Identifier

etd9102

Copyright statement

Copyright is held by the author.

Permissions

This thesis may be printed or downloaded for non-commercial research and scholarly purposes.

Scholarly level

Graduate student (Masters)

Supervisor or Senior Supervisor

Thesis advisor: Liu, Jiangchuan

Thesis advisor: Ergun, Funda

Member of collection

Computing Science Theses

Download file	Size
etd9102_YLe.pdf	901.21 KB

Datacenter-Network-Aware Online Load Balancing in MapReduce

Keywords

Views & downloads - as of June 2023