Performance Impacts and Limitations of Hardware Memory Access Trace Collection

Nicholas C. Doyle1,a, Eric Matthews1,b, Graham Holland1,c, Alexandra Fedorova2 and Lesley Shannon1,d
1School of Engineering Science, Simon Fraser University, Burnaby, Canada.
2Dept. of Electrical and Computer Engineering, University of British Columbia, Vancouver, Canada.


In today's multicore architectures, complex interactions between applications in the memory system can have a significant and highly variable impact on application execution time. System designers typically use hardware counters to profile execution behaviours and diagnose performance problems. However, hardware counters are not always sufficient and some problems are best identified with full memory access traces. Collecting these traces in software is very expensive; our work explores using dedicated hardware for memory-access trace collection. We analyze the limitations of this approach and its impacts on application performance. Our study is performed on actual hardware using two very different CPU platforms: 1) the PolyBlaze multicore soft processor and 2) the ARM Cortex-A9. In both cases, the data collection is implemented on an FPGA. Using micro-benchmarks designed to test the bounds of memory access behaviour, we illustrate the operational regions of data collection and the impact on system performance. By examining the bandwidth bottlenecks that limit the rate of data collection, as well as hardware architecture choices that can aggravate the impact on application performance, we provide guidelines that can be used to extrapolate our analysis to other systems and processor architectures.

Full Text (PDF)