Wang, Xiaojian

Resource type

Thesis

Thesis type

(Thesis) M.Sc.

Date created

2016-06-20

Authors/Contributors

Author: Wang, Xiaojian

Abstract

In modern column-oriented databases, compression is important for improving I/O throughput and overall database performance. Many string columnar data cannot be compressed by special-purpose algorithms such as run-length encoding or dictionary compression, and the typical choice for them is the LZ77-based compression algorithms such as GZIP or Snappy. These algorithms treat data as a byte block and do not exploit the columnar nature of the data. In this thesis, we develop a compression algorithm using frequent string patterns directly mined from a sample of a string column. The patterns are used as the dictionary phrases for compression. We discuss some interesting properties of frequent patterns in the context of compression, and develop a pruning method to address the cache inefficiencies in indexing the patterns. Experiments show that our compression algorithm outperforms Snappy in compression ratio while retains compression and decompression speed.

Keywords

Identifier

etd9638

Copyright statement

Copyright is held by the author.

Permissions

This thesis may be printed or downloaded for non-commercial research and scholarly purposes.

Scholarly level

Graduate student (Masters)

Supervisor or Senior Supervisor

Thesis advisor: Pei, Jian

Member of collection

Computing Science Theses

Download file	Size
etd9638_XWang.pdf	966.6 KB

Efficiently compressing string columnar data using frequent pattern mining

Keywords

Views & downloads - as of June 2023