Affiliations: ARM Ltd, 110 Fulbourn Road, Cambridge, CB1 9NJ,
UK | Hewlett-Packard Labs, 1501 Page Mill Rd, Palo Alto CA
94304, USA | INRIA Futurs and LRI, University of Paris-Sud,
France
Abstract: Indirect memory accesses, where a load is fed by another load, are
ubiquitous because of rich data structures and sophisticated software
conventions, such as the use of linkage tables and position independent code.
Unfortunately, they can be costly: if both loads miss, two round trips to
memory are required even though the role of the first load is often limited to
fetching the address of the second load. To reduce the total latency of such
indirect accesses, a new instruction called load squared is introduced. A load
squared does two fetches, the first fetch reading the target address of the
second. (An offset is optionally added to the result of the first fetch.) The
load squared operation is performed by memory-side logic (typically, the memory
controller if it isn't located on the main processor chip). In this study, load
squared is not an architecturally visible instruction: the micro-architecture
transparently decides which loads should be replaced by loads squared. We show
that performance is sometimes improved significantly, and never degraded.