The communication overhead in distributed deep learning caused by the synchronization of model parameters across multiple devices can significantly impact training time. Although powerful GPU-GPU communication libraries, such as NCCL, are available, their default configurations have not been effectively adapted to varying hardware and workloads, which can result in lower performance.In this paper, we explore the tuning potential of NCCL and present an approach to tuning its parameters for distributed deep learning workloads. We identify efficient parameter configurations through an optimization process that explores the solution space defined by performance-related NCCL parameters. Experimental results on the Leonardo supercomputer, utilizing up to 64 GPUs, show significant performance improvements across micro-benchmarks and three deep learning models. For ncclAllReduce and ncclAllGather, we improved the bandwidth by factors of 112× and 36× in micro-benchmarks, respectively. The tuned NCCL parameter configurations reduced the training time of the models by up to 12.5%.

Exploring NCCL Tuning Strategies for Distributed Deep Learning

SalimiBeni M.;Cosenza B.;
2025

Abstract

The communication overhead in distributed deep learning caused by the synchronization of model parameters across multiple devices can significantly impact training time. Although powerful GPU-GPU communication libraries, such as NCCL, are available, their default configurations have not been effectively adapted to varying hardware and workloads, which can result in lower performance.In this paper, we explore the tuning potential of NCCL and present an approach to tuning its parameters for distributed deep learning workloads. We identify efficient parameter configurations through an optimization process that explores the solution space defined by performance-related NCCL parameters. Experimental results on the Leonardo supercomputer, utilizing up to 64 GPUs, show significant performance improvements across micro-benchmarks and three deep learning models. For ncclAllReduce and ncclAllGather, we improved the bandwidth by factors of 112× and 36× in micro-benchmarks, respectively. The tuned NCCL parameter configurations reduced the training time of the models by up to 12.5%.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4942020
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact