The Discrete Cosine Transform (DCT) is a cornerstone of the JPEG standard, nevertheless its direct implementation entails significant computational complexity for full-image processing, due to intensive matrix operations. Building upon the methodology proposed by Haweel et al. in 2016, which utilizes a specific transformation matrix to streamline block-based DCT through matrix multiplication, this work proposes an optimized parallel version designed for GPUs. This implementation leverages advanced strategies, including memory coalescing, the reduction of thread divergence, and the efficient management of GPU memory hierarchies. Experimental results demonstrate that the optimized algorithm achieves a remarkable speed-up of compared to traditional CPU-based approaches. Furthermore, the implementation significantly outperforms some existing parallel solutions for GPU, reducing execution time by up to compared to an efficient CUDA-based algorithm and by over compared to the standard cuBLAS library. These performance gains are especially pronounced when processing high-resolution images, highlighting the scalability and computational efficiency of the proposed approach for large-scale visual data.
A GPU Accelerated DCT Implementation for Image Compression
Cardone, Angelamaria
;Di Pascale, Gerardo
2026
Abstract
The Discrete Cosine Transform (DCT) is a cornerstone of the JPEG standard, nevertheless its direct implementation entails significant computational complexity for full-image processing, due to intensive matrix operations. Building upon the methodology proposed by Haweel et al. in 2016, which utilizes a specific transformation matrix to streamline block-based DCT through matrix multiplication, this work proposes an optimized parallel version designed for GPUs. This implementation leverages advanced strategies, including memory coalescing, the reduction of thread divergence, and the efficient management of GPU memory hierarchies. Experimental results demonstrate that the optimized algorithm achieves a remarkable speed-up of compared to traditional CPU-based approaches. Furthermore, the implementation significantly outperforms some existing parallel solutions for GPU, reducing execution time by up to compared to an efficient CUDA-based algorithm and by over compared to the standard cuBLAS library. These performance gains are especially pronounced when processing high-resolution images, highlighting the scalability and computational efficiency of the proposed approach for large-scale visual data.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


