Khatti, Moazin

Resource type

Thesis

Thesis type

(Thesis) M.A.Sc.

Date created

2023-11-17

Authors/Contributors

Author: Khatti, Moazin

Abstract

In recent years, the adoption of FPGAs in datacenters has increased, with a growing number of users choosing High-Level Synthesis (HLS) as their preferred programming method. While HLS simplifies FPGA programming, one notable challenge arises when scaling up designs for modern data-center FPGAs, which comprise multiple dies. The extra delays introduced due to die crossings and routing congestion can significantly degrade the frequency of designs on these FPGA boards. Due to the gap between HLS design and physical design, it is challenging for HLS programmers to analyze and identify the root causes, and fix their HLS design to achieve better timing closure. Recent efforts have aimed to address these issues by employing coarse-grained partitioning, placement, and pipelining strategies on task-parallel HLS designs where multiple tasks run concurrently and communicate through FIFO stream channels. However, many applications are not streaming friendly and many existing accelerator designs heavily rely on buffer channel based communication between tasks. In this work, we take a step further to support a task-parallel programming model where tasks can communicate via both FIFO stream channels and buffer channels. To achieve this goal, we design and implement the PASTA framework, which takes a large task-parallel HLS design as input and automatically generates a high-frequency FPGA accelerator via HLS and physical design co-optimization. Our framework introduces a latency-insensitive buffer channel design, which supports memory partitioning and ping-pong buffering while remaining compatible with vendor HLS tools. On the frontend, we provide an intuitive programming model for utilizing the buffer channel; while on the backend, we implement efficient placement and pipelining strategies for the buffer channel. To validate the effectiveness of our framework, we test it on widely used Rodinia HLS benchmarks and two real-world designs and show an average frequency improvement of 19%, with peak improvements of up to 89% on AMD/Xilinx Alveo U280 and U50 FPGA boards compared to Vitis HLS baselines.

Extent

76 pages.

Keywords

Identifier

etd22780

Copyright statement

Copyright is held by the author(s).

Permissions

This thesis may be printed or downloaded for non-commercial research and scholarly purposes.

Supervisor or Senior Supervisor

Thesis advisor: Fang, Zhenman

Language

English

Member of collection

Engineering Science Theses

Download file	Size
etd22780.pdf	17.37 MB

Programming and automation support for scalable task-parallel HLS programs on modern multi-die FPGAs

Keywords

Views & downloads - as of June 2023