Since the invention of the microprocessor in 1971, the computational capacity of the microprocessor has scaled over 1000x with Moore and Dennard scaling. Dennard scaling ended with a rapid increase in leakage power 30 years after it was proposed. This ushered in the era of multiprocessing where additional transistors afforded by Moore's scaling were put to use. The breakdown of Moore's law indicates the start of a new era for computer architects. With the scaling of computational capacity no longer guaranteed every generation, application specific hardware specialization is an attractive alternative to sustain scaling trends. Hardware specialization broadly refers to the identification and optimization of recurrent patterns, dynamic and static, in software via integrated circuitry. This dissertation describes a two-pronged approach to architectural specialization.First, a top down approach uses program analysis to determine code regions amenable for specialization. We have implemented a prototype compiler tool-chain to automatically identify, analyze, extract and grow code segments which are amenable to specialization in a methodical manner. Second, a bottom up approach evaluated particular hardware enhancements to enable the efficient data movement of specialized regions. We have devised and evaluated coherence protocols and flexible caching mechanisms to reduce the overhead of data movement within specialized regions. The former workload centric approach analyses programs at the path granularity. We enumerate static and dynamic program characteristics accurately with low overhead. Our observations show that analysis of amenability for specialization along the path granularity yield different conclusions than prior work. We show that analyses at coarser granularities tend to smear program characteristics critical to specialization. We analyse the potential for performance and energy improvement via specialization at the path granularity. We develop mechanisms to extract and merge amenable paths into segments called Braids. Braids are constructed from the observation that oft-executed program paths have the same start and end point. This allows for increased offload opportunity while retaining the same interface as path granularity specialization. To address the challenges of data movement, the latter micro-architecture first approach, proposes a specialized coherence protocol tailored for accelerators and an adaptive granularity caching mechanism. The hybrid coherence protocol localizes data movement to a specialized accelerator-only tile reducing energy consumption and improving performance. Modern workloads have varied program characteristics where fixed granularity caching often introduces waste in the cache hierarchy. Frequently cache blocks are evicted before all words in the fetched line are touched by the processor. We propose a variable granularity caching mechanism which reduces energy consumption while improving performance via better utilization of the available storage space.
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Shriraman, Arrvindh
Member of collection