In the context of software development, the detection of vulnerabilities within source code is a paramount concern, especially for programming languages like C and C++ that are widely used in mission-critical applications, operating systems and embedded software. Traditional approaches to detecting vulnerabilities in source code often struggle due to their reliance on hand-crafted rules and pattern matching, which can lead to high rates of false positives and require a considerable effort by human experts. Additionally, the evolving nature of software development practices and the increasing sophistication of cyber threats constantly challenge traditional systems, making them less and less useful over time. In this paper we explore the effectiveness of state-of-the-art deep learning methods in identifying vulnerabilities within C/C++ source code of real-world software projects. We have conducted a comprehensive analysis comparing basic deep learning methods used for text processing against more advanced architectures, including Transformers and Graph Neural Networks (GNNs), aiming to provide a reliable benchmark for evaluating vulnerability detection approaches. To this purpose we have prepared a large dataset, combining and normalizing data from several publicly available code datasets extracted from well-known open-source software projects, namely Big-Vul, DiverseVul, Devign and ReVeal. The results of the analysis provide insights about the complexity of the task at hand when faced in a realistic setup and suggest some challenges and promising research directions to use the most recent deep learning models.

Predicting Source Code Vulnerabilities Using Deep Learning: A Fair Comparison on Real Data

Carletti V.;Foggia P.;Saggese A.;Vento M.
2024

Abstract

In the context of software development, the detection of vulnerabilities within source code is a paramount concern, especially for programming languages like C and C++ that are widely used in mission-critical applications, operating systems and embedded software. Traditional approaches to detecting vulnerabilities in source code often struggle due to their reliance on hand-crafted rules and pattern matching, which can lead to high rates of false positives and require a considerable effort by human experts. Additionally, the evolving nature of software development practices and the increasing sophistication of cyber threats constantly challenge traditional systems, making them less and less useful over time. In this paper we explore the effectiveness of state-of-the-art deep learning methods in identifying vulnerabilities within C/C++ source code of real-world software projects. We have conducted a comprehensive analysis comparing basic deep learning methods used for text processing against more advanced architectures, including Transformers and Graph Neural Networks (GNNs), aiming to provide a reliable benchmark for evaluating vulnerability detection approaches. To this purpose we have prepared a large dataset, combining and normalizing data from several publicly available code datasets extracted from well-known open-source software projects, namely Big-Vul, DiverseVul, Devign and ReVeal. The results of the analysis provide insights about the complexity of the task at hand when faced in a realistic setup and suggest some challenges and promising research directions to use the most recent deep learning models.
2024
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4877693
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact