Dependency-aware Data Locality for MapReduce

Resource type
Thesis type
(Thesis) M.Sc.
Date created
2015-11-17
Authors/Contributors
Author: Fan, Xiaoyi
Abstract
Recent years have witnessed the prevalence of MapReduce-based systems, e.g., Apache Hadoop, in large-scale distributed data processing. In this new generation of big data processing framework, \emph{data locality} that seeks to co-locate computation with data, can effectively improve MapReduce performance since fetching data from remote servers across multiple network switches is known to be costly. There have been a significant studies on {\it data locality} that seeks to co-locate computation with data, so as to reduce cross-server traffic in MapReduce. They generally assume that the input data have little dependency with each other, which however is not necessarily true for that of many real-world applications, and we show strong evidence that the finishing time of MapReduce tasks can be greatly prolonged with such data dependency. State-of-the-art data replication and task scheduling strategies achieve data locality through replicating popular files and spreading the replicas over multiple servers. We take the graph data, a widely adopted data format in many big data application scenarios, as a representative case study. It illustrates that while working well for independent files, conventional data locality strategies can store highly dependent files on different servers, incurring excessive remote data accesses and consequently prolonging the job completion time. In this thesis, we present DALM (Dependency-Aware Locality for MapReduce), a comprehensive and practical solution toward dependency-aware locality for processing the real-world input data that can be highly skewed and dependent. DALM accommodates data-dependency in a data-locality framework, organically synthesizing the key components from data reorganization, replication, placement. Beside algorithmic design within the framework, we have also closely examined the deployment challenges, particularly in public virtualized cloud environments. We extensively evaluate DALM through both simulations and real-world implementations in a typical virtualized environment, and compare it with state-of-the-art solutions, including the default Hadoop system, the partition-based Surfer, and the popularity-based Scarlett. The results show that DALM can significantly improve data locality for different inputs. For popular iterative graph processing applications on Hadoop, our prototype implementation of DALM significantly reduces the job finish time, as compared to the default Hadoop system, Scarlett, and Surfer.
Document
Identifier
etd9284
Copyright statement
Copyright is held by the author.
Permissions
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor: Liu, Jiangchuan
Member of collection
Attachment Size
etd9284_XFan.pdf 3.56 MB