Large-scale compute clusters are highly affected by performance variability that originates from different sources. Among these sources, the network plays an essential role as a shared resource between users and their jobs in a supercomputer. In this paper, we analyze the effect of some network-related sources on the performance variability of a modern compute cluster equipped with a Dragonfly+ interconnect. Specifically, we focus on the impacts of job placement, communication patterns, routing strategy, and network background traffic on the performance variability of communication-intensive workloads.To quantify the effect of network congestion (background traffic) on the performance variability, we propose a heuristic that can successfully estimate the amount of communication on the network produced by other jobs running on the cluster simultaneously. Then, we show how this network congestion contributes to the performance variability of different communication patterns and real-world communication-intensive applications.
File in questo prodotto:
Non ci sono file associati a questo prodotto.