Abstract Modern collections of algorithms for DSP and multimedia often rely on linear algebra operators to perform massive numerical transformations on vectorized data. Embedded developers often experience the worst condition of having no FPU at all in their low-power systems, as many device producers consider FP-math as an expensive option in terms of gates and power consumption. Main aim of this work is to propose a lightweight structure, designed to be used in an ARM-based environment but easily retargetable to different architectures, capable to perform efficiently vectorized FP operations as described in BLAS Level 1 specification. Peculiar feature is the capability of such a coprocessor to work in a fully pipelined mode. Both single and double precision calculations can be performed. Many different CPU offloading techniques have been implemented, in order to enable reactive power management policies during idle/waiting time slices. An implementation in VHDL is presented as result, showing synthesis and placement results in different technologies. FPGA+ARM9 prototype is presented and benchmarked. Results have been compared to functionally equivalent solutions running in different environments and using different sets of processing primitives (up to x86's SSE2/3/4). Finally, a complex application for Hidden Markov Model (HMM) training and evaluation is used as test case to evaluate real-world performance of the proposed approach.

Modelling a fast BLAS level-1 inspired vectorized FPU for ARM devices

RAICONI, Giancarlo;
2011-01-01

Abstract

Abstract Modern collections of algorithms for DSP and multimedia often rely on linear algebra operators to perform massive numerical transformations on vectorized data. Embedded developers often experience the worst condition of having no FPU at all in their low-power systems, as many device producers consider FP-math as an expensive option in terms of gates and power consumption. Main aim of this work is to propose a lightweight structure, designed to be used in an ARM-based environment but easily retargetable to different architectures, capable to perform efficiently vectorized FP operations as described in BLAS Level 1 specification. Peculiar feature is the capability of such a coprocessor to work in a fully pipelined mode. Both single and double precision calculations can be performed. Many different CPU offloading techniques have been implemented, in order to enable reactive power management policies during idle/waiting time slices. An implementation in VHDL is presented as result, showing synthesis and placement results in different technologies. FPGA+ARM9 prototype is presented and benchmarked. Results have been compared to functionally equivalent solutions running in different environments and using different sets of processing primitives (up to x86's SSE2/3/4). Finally, a complex application for Hidden Markov Model (HMM) training and evaluation is used as test case to evaluate real-world performance of the proposed approach.
2011
9781612848563
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/3048517
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact