Skip to main content

Leveraging static dataflow in dynamically-compiled pipeline-parallel text-parsing programs using the Parabix framework

Resource type
Thesis type
(Thesis) Ph.D.
Date created
Modern Big Data systems have shifted away from pure ETL solutions and towards real-time analysis of "raw text". Unsurprisingly, efficient analysis of raw text is a critical performance concern for Big Data systems. Parabix is an intrinsically parallel programming paradigm that is well suited for high-speed text matching, validation and transcoding problems. This framework combines the inherit SIMD bit-parallelism of the Pablo programming language, the thread-parallelism of linear-pipeline programming, and the dynamic-compilation capabilities of LLVM MCJIT to create ad-hoc text-query applications. In this dissertation, we extend the existing Parabix framework with a more generalized notion of processing rates, I/O attributes and automate the memory management of the program. Our goal is to support a wider range of programs than prior frameworks allowed and to reduce programmer effort when constructing highly-parallel programs. To do so, we incorporate dataflow-programming techniques, which allow us to better analyze pipeline stage relationships and automatically compile efficient linear-pipelines for variable-rate systems. We extend upon existing literature to provide a scheduling algorithm tailored for such systems. Our goal is reliably provide a near-linear multi-threaded speedup on commodity hardware for up to ⌊total single−threaded time/max(individual pipeline stage time)⌋ threads. Compared to the prior version of the Parabix framework, our work increases the average program acceleration by 8.2 – 63.9%, with the best improvements occurring on newer architectures with a greater number of threads. Our evaluation shows that by combining our fixed-data linear-pipeline model, data-parallel execution of state-free stages, and dataflow programming concepts, we can exceed our expected ideal speed-up limit for two of the ten test cases by nearly 40% and 60%, achieving a 3.5× acceleration with four threads. For six of the remaining eight cases, the framework attains at least 83% of the limit, with half being over 90%. We assess each program in detail — including the final problematic test case that only reaches 71% of our goal — and provide recommendations for future work.
310 pages.
Copyright statement
Copyright is held by the author(s).
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Cameron, Robert
Thesis advisor: Sumner, Nick
Member of collection
Download file Size
etd22017.pdf 5.56 MB

Views & downloads - as of June 2023

Views: 12
Downloads: 3