Data reduction is a critical part of big data analysis and pre-processing. Compression-based data reduction methods are well known in the big data environment. But the processing overheads of decompression introduce latency which lowers performance of analytics algorithms in real-time environments. This thesis investigates techniques to minimize the decompression overhead, allowing regex search on compressed data. We propose ZTF-Grep, which follows two-level scanning to find regex matches. First scan searches word-only sub-expressions and decompresses the data segments most likely containing full regex matches. In the second scan, decompressed segments are searched for target regex, reducing overheads of complete decompression. The compression algorithm ZTF-phrase-hash, developed on the foundation of the Parabix framework, is a dictionary-based algorithm which replaces repeated Unicode n-grams with code-words. Our approach of selective decompression of compressed files shows performance benefits for regex queries requiring decompression of upto 63% of the compressed files.
Copyright is held by the author(s).
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Cameron, Robert
Member of collection