Machine Learning (ML) has revolutionized various domains, offering predictive capabilities in several areas. However, there is growing evidence in the literature that ML approaches are not always used appropriately, leading to incorrect and sometimes overly optimistic results. One reason for this inappropriate use of ML may be the increasing availability of machine learning tools, leading to what we call the “push the button” approach. While this approach provides convenience, it raises concerns about the reliability of outcomes, leading to challenges such as incorrect performance evaluation. In particular, this paper addresses a critical issue in ML, known as data leakage, where unintended information contaminates the training data, impacting model performance evaluation. Indeed, crucial steps in ML pipeline can be inadvertently overlooked, leading to optimistic performance estimates that may not hold in real-world scenarios. The discrepancy between evaluated and actual performance on new data is a significant concern. In particular, this paper categorizes data leakage in ML, discussing how certain conditions can propagate through the ML approach workflow. Furthermore, it explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning framework, and compares standard inductive ML with transductive ML paradigms. The conclusion summarizes key findings, emphasizing the importance of addressing data leakage for robust and reliable ML applications considering tasks and generalization goals.

Don’t push the button! Exploring data leakage risks in machine learning and transfer learning

Apicella, Andrea
;
2025

Abstract

Machine Learning (ML) has revolutionized various domains, offering predictive capabilities in several areas. However, there is growing evidence in the literature that ML approaches are not always used appropriately, leading to incorrect and sometimes overly optimistic results. One reason for this inappropriate use of ML may be the increasing availability of machine learning tools, leading to what we call the “push the button” approach. While this approach provides convenience, it raises concerns about the reliability of outcomes, leading to challenges such as incorrect performance evaluation. In particular, this paper addresses a critical issue in ML, known as data leakage, where unintended information contaminates the training data, impacting model performance evaluation. Indeed, crucial steps in ML pipeline can be inadvertently overlooked, leading to optimistic performance estimates that may not hold in real-world scenarios. The discrepancy between evaluated and actual performance on new data is a significant concern. In particular, this paper categorizes data leakage in ML, discussing how certain conditions can propagate through the ML approach workflow. Furthermore, it explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning framework, and compares standard inductive ML with transductive ML paradigms. The conclusion summarizes key findings, emphasizing the importance of addressing data leakage for robust and reliable ML applications considering tasks and generalization goals.
2025
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4918802
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 1
social impact