Instruction Reuse in SPEC, media and packet processing benchmarks:
A comparative study of power, performance and related microarchitectural
optimizations
Issue title: Embeded Processors and Systems: Architectural
Issues and Solutions for Emerging Applications
Affiliations: CAD Lab, SERC, Indian Institute of Science, Bangalore
560012, India. E-mail: surendra@cadl.iisc.ernet.in;
subhasis@cadl.iisc.ernet.in; nandy@serc.iisc.ernet.in
Note: [] Corresponding author: G. Surendra, E-mail:
surendra@cadl.iisc.ernet.in
Abstract: The effectiveness of Instruction Reuse (IR) – a technique to
eliminate redundant computations at run time – is limited by the fact
that performance gain seldom exceeds 3% and is dependent on the criticality of
instructions being "reused". In this paper, we focus on the power aspect of IR
and propose a "resultbus optimization" that exploits communication reuse to
reduce the power dissipated over a high capacitance resultbus. The
effectiveness of this optimization depends on the number of result producing
instructions that are reused and improves overall power and Energy-Delay
Product (EDP) by 3% over a base IR policy for a 1024 entry "Reuse Buffer" (RB).
As a domain specific study, we examine the impact of multithreading
on IR in the context of packet header processing applications. Specifically,
sharing the RB among threads can lead to either constructive or destructive
interference, thereby increasing or decreasing the amount of IR that can be
uncovered. Further, packet header processing applications are unique in the
sense that repetition in data values within "flows" are quite prevalent which
can be exploited to improve IR. We find that an architecture that uses this
"flow" information to govern accesses to the RB improves IR by as much as 4.6%
for header processing kernels.
Keywords: Instruction reuse, computation reuse, value locality, flow aggregation, low power, resultbus power