Hardware-Accelerated Energy-Efficient Synchronization and Communication for Ultra-Low-Power Tightly Coupled Clusters

Florian Glaser1,a, Germain Haugou1,b, Davide Rossi2, Qiuting Huang1,c and Luca Benini1,2,d
1Integrated Systems Laboratory, ETH Zürich, Switzerland
aglaser@iis.ee.ethz.ch
bhaugoug@iis.ee.ethz.ch
chuang@iis.ee.ethz.ch
dbenini@iis.ee.ethz.ch
2Electrical, Electronic, and Information Engineering, University of Bologna, Italy
davide.rossi@unibo.it

ABSTRACT


Parallel ultra low power computing is emerging as an enabler to meet the growing performance and energy efficiency demands in deeply embedded systems such as the end-nodes of the internet-of-things (IoT). The parallel nature of these systems however adds a significant degree of complexity as processing elements (PEs) need to communicate in various ways to organize and synchronize execution. Naive implementations of these central and non-trivial mechanisms can quickly jeopardize overall system performance and limit the achievable speedup and energy efficiency. To avoid this bottleneck, we present an event-based solution centered around a technology-independent, light-weight and scalable (up to 16 cores) synchronization and communication unit (SCU) and its integration into a sharedmemory multicore cluster. Careful design and tight coupling of the SCU to the data interfaces of the cores allows to execute common synchronization procedures with a single instruction. Furthermore, we present hardware support for the common barrier and lock synchronization primitives with a barrier latency of only eleven cycles, independent of the number of involved cores. We demonstrate the efficiency of the solution based on experiments with a post-layout implementation of the multicore cluster in a 22nm CMOS process where the SCU constitutes less than 2%of area overhead. Our solution supports parallel sections as small as 100 or 72 cycles with a synchronization overhead of just 10 %, an improvement of up to 14× or 30× with respect to cycle count or energy, respectively, compared to a test-and-set based implementation.



Full Text (PDF)