Resource type
Thesis type
(Thesis) M.Sc.
Date created
2018-10-05
Authors/Contributors
Author: Wu, Xiangyu
Abstract
Data compression is a commonly used technique in the big data environment. On the other hand, efficient information retrieval of those compressed big data is also a standard requirement. In this thesis, we proposed a new method of compressed file pattern matching inspired by the bitstream pattern matching approach from ICGrep, a high-performance regular expression matching tool based on Parabix techniques. Instead of using the traditional way that fully decompresses the compressed file before pattern matching, our approach handles many complex procedures in the small compressed space. We selected LZ4 as a sample compression format and implemented a compressed file pattern matching tool LZ4 Grep, which showed small performance improvement. Moreover, we proposed a new LZ4 compression algorithm for UTF-8 text files, which substantially improved the speed of compressed file pattern matching for Unicode regular expression, especially for those regular expressions with predefined Unicode categories.
Document
Identifier
etd19907
Copyright statement
Copyright is held by the author.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor: Cameron, Robert
Member of collection
Download file | Size |
---|---|
etd19907.pdf | 2.14 MB |