Low Complexity Multi-directional In-Air Ultrasonic Gesture Recognition Using a TCN
Emad A. Ibrahim1,a, Marc Geilen1,b, Jos Huisken1,c, Min Li2 and José Pineda de Gyvez1,d
1Department of Electrical Engineering
ae.a.t.ibrahim@tue.nl
bm.c.w.geilen@tue.nl
cmin.li@nxp.com
djose.pineda.de.gyvez@nxp.com
2Department of Industrial Design, Eindhoven University of Technology, Eindhoven, Netherlands
j.a.huisken@tue.nl
ABSTRACT
On the trend of ultrasound-based gesture recognition, this study introduces the concept of time-sequence classification of ultrasonic patterns induced by hand movements on a microphone array. We refer to time-sequence ultrasound echoes as continuous frequency patterns being received in real-time at different steering angles. The ultrasound source is a single tone continuously being emitted from the center of the microphone array. In the interim, the array beamforms and locates an ultrasonic activity (induced echoes) after which a processing pipeline is initiated to extract band-limited frequency features. These beamformed features are organized in a 2D matrix of size 11 × 30 updated every 10ms on which a Temporal Convolutional Network (TCN) outputs continuous classification. Prior to that, the same TCN is trained to classify Doppler shift variability rate. Using this approach, we show that a user can easily achieve 49 gestures at different steering angles by means of sequence detection. To make it simple to users, we define two Doppler shift variability rates; very slow and very fast which the TCN detects 95-99% of the time. Not only a gesture can be performed at different directions but also the length of each performed gesture can be measured. This leverages the diversity of inair ultrasonic gestures allowing more control capabilities. The process is designed under low-resource settings; that is, given the fact that this real-time process is always-on, the power and memory resources should be optimized. The proposed solution needs 6:2–10:2 MMACs and a memory footprint of 6KB allowing such gesture recognition system to be hosted by energyconstrained edge devices such as smart-speakers.
Keywords: Gesture Recognition, Temporal Convolutional Networks (TCN), Human System Interaction (HSI), Edge Devices, Doppler shift

