Prabhu, Chintana

Resource type

Thesis

Thesis type

(Thesis) M.Sc.

Date created

2022-12-22

Authors/Contributors

Author: Prabhu, Chintana

Abstract

Data reduction is a critical part of big data analysis and pre-processing. Compression-based data reduction methods are well known in the big data environment. But the processing overheads of decompression introduce latency which lowers performance of analytics algorithms in real-time environments. This thesis investigates techniques to minimize the decompression overhead, allowing regex search on compressed data. We propose ZTF-Grep, which follows two-level scanning to find regex matches. First scan searches word-only sub-expressions and decompresses the data segments most likely containing full regex matches. In the second scan, decompressed segments are searched for target regex, reducing overheads of complete decompression. The compression algorithm ZTF-phrase-hash, developed on the foundation of the Parabix framework, is a dictionary-based algorithm which replaces repeated Unicode n-grams with code-words. Our approach of selective decompression of compressed files shows performance benefits for regex queries requiring decompression of upto 63% of the compressed files.

Extent

61 pages.

Keywords

Identifier

etd22309

Copyright statement

Copyright is held by the author(s).

Permissions

This thesis may be printed or downloaded for non-commercial research and scholarly purposes.

Supervisor or Senior Supervisor

Thesis advisor: Cameron, Robert

Language

English

Member of collection

Computing Science Theses

Download file	Size
etd22309.pdf	2.39 MB

Selective decompression for high-performance compressed pattern matching using Parabix algorithms

Keywords

Views & downloads - as of June 2023