Efficient performance testing of microservices is essential for engineers to ensure that deviations of performance/resource usage metrics from expectations are promptly identified within their rapid release cycle. To this aim, engineers would need to explore the space of possible workload configurations and focus only on the critical ones, e.g., low-load configurations that unexpectedly cause performance issues. This requires a great effort, and can be infeasible in short release cycles.We present CALLMIT, a framework using Large Language Models (LLM) enhanced by causal reasoning to automatically generate critical workloads for microservices performance testing. Engineers query CALLMIT to generate workload configurations expected to expose deviations from performance requirements, so as to actually run only tests that trigger critical configurations. We present the experimental evaluation on three subjects, with comparison to a conventional Retrieval-Augmented Generation technique. The results show that causal models improve the correct identification by LLM of performance-critical workload configurations.

Microservices Performance Testing with Causality-enhanced Large Language Models

Mascia C.;Guerriero A.;
2025

Abstract

Efficient performance testing of microservices is essential for engineers to ensure that deviations of performance/resource usage metrics from expectations are promptly identified within their rapid release cycle. To this aim, engineers would need to explore the space of possible workload configurations and focus only on the critical ones, e.g., low-load configurations that unexpectedly cause performance issues. This requires a great effort, and can be infeasible in short release cycles.We present CALLMIT, a framework using Large Language Models (LLM) enhanced by causal reasoning to automatically generate critical workloads for microservices performance testing. Engineers query CALLMIT to generate workload configurations expected to expose deviations from performance requirements, so as to actually run only tests that trigger critical configurations. We present the experimental evaluation on three subjects, with comparison to a conventional Retrieval-Augmented Generation technique. The results show that causal models improve the correct identification by LLM of performance-critical workload configurations.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4918560
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact