Fan, Xiaoyi

Resource type

Thesis

Thesis type

(Thesis) M.Sc.

Date created

2015-11-17

Authors/Contributors

Author: Fan, Xiaoyi

Abstract

Recent years have witnessed the prevalence of MapReduce-based systems, e.g., Apache Hadoop, in large-scale distributed data processing. In this new generation of big data processing framework, \emph{data locality} that seeks to co-locate computation with data, can effectively improve MapReduce performance since fetching data from remote servers across multiple network switches is known to be costly. There have been a significant studies on {\it data locality} that seeks to co-locate computation with data, so as to reduce cross-server traffic in MapReduce. They generally assume that the input data have little dependency with each other, which however is not necessarily true for that of many real-world applications, and we show strong evidence that the finishing time of MapReduce tasks can be greatly prolonged with such data dependency. State-of-the-art data replication and task scheduling strategies achieve data locality through replicating popular files and spreading the replicas over multiple servers. We take the graph data, a widely adopted data format in many big data application scenarios, as a representative case study. It illustrates that while working well for independent files, conventional data locality strategies can store highly dependent files on different servers, incurring excessive remote data accesses and consequently prolonging the job completion time. In this thesis, we present DALM (Dependency-Aware Locality for MapReduce), a comprehensive and practical solution toward dependency-aware locality for processing the real-world input data that can be highly skewed and dependent. DALM accommodates data-dependency in a data-locality framework, organically synthesizing the key components from data reorganization, replication, placement. Beside algorithmic design within the framework, we have also closely examined the deployment challenges, particularly in public virtualized cloud environments. We extensively evaluate DALM through both simulations and real-world implementations in a typical virtualized environment, and compare it with state-of-the-art solutions, including the default Hadoop system, the partition-based Surfer, and the popularity-based Scarlett. The results show that DALM can significantly improve data locality for different inputs. For popular iterative graph processing applications on Hadoop, our prototype implementation of DALM significantly reduces the job finish time, as compared to the default Hadoop system, Scarlett, and Surfer.

Keywords

Identifier

etd9284

Copyright statement

Copyright is held by the author.

Permissions

This thesis may be printed or downloaded for non-commercial research and scholarly purposes.

Scholarly level

Graduate student (Masters)

Supervisor or Senior Supervisor

Thesis advisor: Liu, Jiangchuan

Member of collection

Computing Science Theses

Download file	Size
etd9284_XFan.pdf	3.56 MB

Dependency-aware Data Locality for MapReduce

Keywords

Views & downloads - as of June 2023