Modelling a fast BLAS level-1 inspired vectorized FPU for ARM devices

Vigliar, Mario; Raiconi, Giancarlo; D'Auria, Amedeo; Giuseppe Del Mastro,

doi:10.1109/MWSCAS.2011.6026644

Abstract Modern collections of algorithms for DSP and multimedia often rely on linear algebra operators to perform massive numerical transformations on vectorized data. Embedded developers often experience the worst condition of having no FPU at all in their low-power systems, as many device producers consider FP-math as an expensive option in terms of gates and power consumption. Main aim of this work is to propose a lightweight structure, designed to be used in an ARM-based environment but easily retargetable to different architectures, capable to perform efficiently vectorized FP operations as described in BLAS Level 1 specification. Peculiar feature is the capability of such a coprocessor to work in a fully pipelined mode. Both single and double precision calculations can be performed. Many different CPU offloading techniques have been implemented, in order to enable reactive power management policies during idle/waiting time slices. An implementation in VHDL is presented as result, showing synthesis and placement results in different technologies. FPGA+ARM9 prototype is presented and benchmarked. Results have been compared to functionally equivalent solutions running in different environments and using different sets of processing primitives (up to x86's SSE2/3/4). Finally, a complex application for Hidden Markov Model (HMM) training and evaluation is used as test case to evaluate real-world performance of the proposed approach.