Large-scale compute clusters are highly affected by performance variability that originates from different sources. Among these sources, the network plays an essential role as a shared resource between users and their jobs in a supercomputer. In this paper, we analyze the effect of some network-related sources on the performance variability of a modern compute cluster equipped with a Dragonfly+ interconnect. Specifically, we focus on the impacts of job placement, communication patterns, routing strategy, and network background traffic on the performance variability of communication-intensive workloads.To quantify the effect of network congestion (background traffic) on the performance variability, we propose a heuristic that can successfully estimate the amount of communication on the network produced by other jobs running on the cluster simultaneously. Then, we show how this network congestion contributes to the performance variability of different communication patterns and real-world communication-intensive applications.

An Analysis of Performance Variability on Dragonfly+ Topology

Salimibeni, Majid
;
Cosenza, Biagio
2022-01-01

Abstract

Large-scale compute clusters are highly affected by performance variability that originates from different sources. Among these sources, the network plays an essential role as a shared resource between users and their jobs in a supercomputer. In this paper, we analyze the effect of some network-related sources on the performance variability of a modern compute cluster equipped with a Dragonfly+ interconnect. Specifically, we focus on the impacts of job placement, communication patterns, routing strategy, and network background traffic on the performance variability of communication-intensive workloads.To quantify the effect of network congestion (background traffic) on the performance variability, we propose a heuristic that can successfully estimate the amount of communication on the network produced by other jobs running on the cluster simultaneously. Then, we show how this network congestion contributes to the performance variability of different communication patterns and real-world communication-intensive applications.
2022
978-1-6654-9856-2
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4835651
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? 4
social impact