Skip to main content

Accelerating compressed file pattern matching with parabix techniques

Resource type
Thesis type
(Thesis) M.Sc.
Date created
Author: Wu, Xiangyu
Data compression is a commonly used technique in the big data environment. On the other hand, efficient information retrieval of those compressed big data is also a standard requirement. In this thesis, we proposed a new method of compressed file pattern matching inspired by the bitstream pattern matching approach from ICGrep, a high-performance regular expression matching tool based on Parabix techniques. Instead of using the traditional way that fully decompresses the compressed file before pattern matching, our approach handles many complex procedures in the small compressed space. We selected LZ4 as a sample compression format and implemented a compressed file pattern matching tool LZ4 Grep, which showed small performance improvement. Moreover, we proposed a new LZ4 compression algorithm for UTF-8 text files, which substantially improved the speed of compressed file pattern matching for Unicode regular expression, especially for those regular expressions with predefined Unicode categories.
Copyright statement
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor: Cameron, Robert
Member of collection
Download file Size
etd19907.pdf 2.14 MB

Views & downloads - as of June 2023

Views: 0
Downloads: 0