Data compression is a commonly used technique in the big data environment. On the other hand, efficient information retrieval of those compressed big data is also a standard requirement. In this thesis, we proposed a new method of compressed file pattern matching inspired by the bitstream pattern matching approach from ICGrep, a high-performance regular expression matching tool based on Parabix techniques. Instead of using the traditional way that fully decompresses the compressed file before pattern matching, our approach handles many complex procedures in the small compressed space. We selected LZ4 as a sample compression format and implemented a compressed file pattern matching tool LZ4 Grep, which showed small performance improvement. Moreover, we proposed a new LZ4 compression algorithm for UTF-8 text files, which substantially improved the speed of compressed file pattern matching for Unicode regular expression, especially for those regular expressions with predefined Unicode categories.
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Cameron, Robert
Member of collection