MemPool: A Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect

Matheus Cavalcante1,a, Samuel Riedel1,b, Antonio Pullini2 and Luca Benini3
1ETH Zürich Zürich, Switzerland
amatheusd at iis.ee.ethz.ch
bsriedel at iis.ee.ethz.ch
2GreenWaves Technologies Grenoble, France
bpullinia at iis.ee.ethz.ch
3ETH Zürich Zürich, Switzerland Università di Bologna Bologna, Italy
lbenini at iis.ee.ethz.ch

ABSTRACT


A key challenge in scaling shared-L1 multi-core clusters towards many-core (more than 16 cores) configurations is to ensure low-latency and efficient access to the L1 memory. In this work we demonstrate that it is possible to scale up the shared-L1 architecture: We present MemPool, a 32 bit manycore system with 256 fast RV32IMA “Snitch” cores featuring application-tunable execution units, running at 700MHz in typical conditions (TT/0.80V/25˚C). MemPool is easy to program, with all the cores sharing a global view of a large L1 scratchpad memory pool, accessible within at most 5 cycles. In MemPool’s physical-aware design, we emphasized the exploration, design, and optimization of the low-latency processor-to-L1-memory interconnect. We compare three candidate topologies, analyzing them in terms of latency, throughput, and back-end feasibility. The chosen topology keeps the average latency at fewer than 6 cycles, even for a heavy injected load of 0.33 request/core/cycle. We also propose a lightweight addressing scheme that maps each core private data to a memory bank accessible within one cycle, which leads to performance gains of up to 20% in real-world signal processing benchmarks. The addressing scheme is also highly efficient in terms of energy consumption since requests to local banks consume only half of the energy required to access remote banks. Our design achieves competitive performance with respect to an ideal, non-implementable full-crossbar baseline.

Keywords: Many-core, MIMD, Networks-on-Chips.



Full Text (PDF)