Boosting SIMD Benefits through a Run-time and Energy Efficient DLP Detection

Michael Guilherme Jordana, Tiago Knorstb, Julio Vicenzic and Mateus Beck Rutzigd
Electronics and Computing Department - Federal University of Santa Maria - Santa Maria - Brazil
amichael.jordan@ecomp.ufsm.br
btiago.knorst@ecomp.ufsm.br
cjulio.vicenzi@ecomp.ufsm.br
dmateus@inf.ufsm.br

ABSTRACT


Data Level Parallelism has been improving performance-energy tradeoff of current processors by coupling SIMD engines, such as Intel AVX and ARM NEON. Special libraries and compilers are used to support DLP execution on such engines. However, timing overhead on hand coding is inevitable since most software developers are not skilled to extract DLP using unfamiliar libraries. In addition, DLP detection through compiler, besides breaking software compatibility, is limited to static code analysis, which compromises performance gains. In this work, we propose a runtime DLP detection named as Dynamic SIMD Assembler, which transparently identifies vectorizable code regions to execute in the ARM NEON engine. Due to its dynamic fashion, DSA keeps software compatibility and avoids timing overhead on software developing process. Results have shown that DSA outperforms ARM NEON auto-vectorization compiler by 32% since it covers wider vectorized regions, such as Dynamic Range, Sentinel and Conditional Loops. In addition, DSA outperforms hand-vectorized code using ARM library by 26% reducing 45% of energy consumption with no penalties over software development time.

Keywords: DLP, SIMD, Vectorization, ARM NEON.



Full Text (PDF)