Skip to main content

Programming and automation support for scalable task-parallel HLS programs on modern multi-die FPGAs

Resource type
Thesis type
(Thesis) M.A.Sc.
Date created
In recent years, the adoption of FPGAs in datacenters has increased, with a growing number of users choosing High-Level Synthesis (HLS) as their preferred programming method. While HLS simplifies FPGA programming, one notable challenge arises when scaling up designs for modern data-center FPGAs, which comprise multiple dies. The extra delays introduced due to die crossings and routing congestion can significantly degrade the frequency of designs on these FPGA boards. Due to the gap between HLS design and physical design, it is challenging for HLS programmers to analyze and identify the root causes, and fix their HLS design to achieve better timing closure. Recent efforts have aimed to address these issues by employing coarse-grained partitioning, placement, and pipelining strategies on task-parallel HLS designs where multiple tasks run concurrently and communicate through FIFO stream channels. However, many applications are not streaming friendly and many existing accelerator designs heavily rely on buffer channel based communication between tasks. In this work, we take a step further to support a task-parallel programming model where tasks can communicate via both FIFO stream channels and buffer channels. To achieve this goal, we design and implement the PASTA framework, which takes a large task-parallel HLS design as input and automatically generates a high-frequency FPGA accelerator via HLS and physical design co-optimization. Our framework introduces a latency-insensitive buffer channel design, which supports memory partitioning and ping-pong buffering while remaining compatible with vendor HLS tools. On the frontend, we provide an intuitive programming model for utilizing the buffer channel; while on the backend, we implement efficient placement and pipelining strategies for the buffer channel. To validate the effectiveness of our framework, we test it on widely used Rodinia HLS benchmarks and two real-world designs and show an average frequency improvement of 19%, with peak improvements of up to 89% on AMD/Xilinx Alveo U280 and U50 FPGA boards compared to Vitis HLS baselines.
76 pages.
Copyright statement
Copyright is held by the author(s).
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Fang, Zhenman
Member of collection
Download file Size
etd22780.pdf 17.37 MB

Views & downloads - as of June 2023

Views: 29
Downloads: 3