Matrix multiplication circuits are widely used as accelerators in 3D graphics, communications, artificial intelligence, and other domains. Recent years have seen significant advances in efficient algorithms for small-dimension matrix multiplication. The AlphaEvolve algorithm for complex-valued matrices achieves rank-48 using only 48 binary multiplications, improving on the Naive algorithm (64 multiplications) and the previous best-known methods (49 multiplications). Following this development, a new algorithm for real-valued 4×4 matrix multiplication was proposed in June 2025. This work presents the first hardware implementation of the algorithm benchmarked against the Naive algorithm commonly employed in commercial accelerators. Synthesized in a 14 nm FinFET standard cell library, the proposed circuit requires fewer hardware resources than the Naive implementation for input bit-widths above 40 bits. While the Naive implementation achieves higher speed and lower power, the performance gap decreases with increasing operand size, highlighting the potential of the proposed approach for high-precision applications.
VLSI Implementation of Alphaevolve Based Rank-48 Algorithm for 4×4 Real-Valued Matrix Multiplication
Napoli, Ettore
2026
Abstract
Matrix multiplication circuits are widely used as accelerators in 3D graphics, communications, artificial intelligence, and other domains. Recent years have seen significant advances in efficient algorithms for small-dimension matrix multiplication. The AlphaEvolve algorithm for complex-valued matrices achieves rank-48 using only 48 binary multiplications, improving on the Naive algorithm (64 multiplications) and the previous best-known methods (49 multiplications). Following this development, a new algorithm for real-valued 4×4 matrix multiplication was proposed in June 2025. This work presents the first hardware implementation of the algorithm benchmarked against the Naive algorithm commonly employed in commercial accelerators. Synthesized in a 14 nm FinFET standard cell library, the proposed circuit requires fewer hardware resources than the Naive implementation for input bit-widths above 40 bits. While the Naive implementation achieves higher speed and lower power, the performance gap decreases with increasing operand size, highlighting the potential of the proposed approach for high-precision applications.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


