Abstract:
Memory can be located close to a CPU, at remote sockets, or on devices connected via interconnects such as CXL or NVLink. A larger distance between memory and a core accessing the mem ory usually results in higher access latency. Software prefetching algorithms claim to hide memory access latencies by moving data to the CPU cache before a core accesses the data. In this work, we analyze to what extent software prefetching can hide increased memoryaccess latencies. We evaluate these on seven systems, each offering different memory technologies and access latencies. We show that prefetching can increase performance by up to 2.6× and 2.8× for B+-Tree and binary search workloads. We find that CPU f ill buffers, which track L1 cache misses, and a workload’s memory intensity dictate how much access latency can be hidden. CPUs implement prefetches differently. We introduce microbenchmarks that identify concrete target cache and eviction strategies for dif ferent prefetch localities across x86 and ARM architectures. When the fill buffers are full, CPUs either drop prefetches or halt until all can be executed. We refer to these behaviors as weak and strong prefetching reliability. We introduce microbenchmarks identifying a CPU’s reliability. When prefetching 8 KiB B+-Tree nodes, weak reliability achieves a speedup of 2× while strong reliability degrades performance with a slowdown of 2.5× for lookup workloads.